[ { "output": " BlueData DataTap Setup\n\n\nThis section provides instructions for configuring Driverless AI to work with BlueData DataTap." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``dtap_auth_type``: Selects DTAP authentication." }, { "output": " If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user." }, { "output": " This folder can contain multiple config files. Note: The DTAP config file core-site.xml needs to contain DTap FS configuration, for example:\n\n ::\n\n \n \n fs.dtap.impl\n com.bluedata.hadoop.bdfs.Bdfs\n The FileSystem for BlueData dtap: URIs.\n \n \n\n- ``dtap_key_tab_path``: The path of the principal key tab file." }, { "output": " - ``dtap_app_principal_user``: The Kerberos app principal user (recommended). - ``dtap_app_login_user``: The user ID of the current user (for example, user@realm)." }, { "output": " Separate each argument with spaces. - ``dtap_app_classpath``: The DTap classpath. - ``dtap_init_path``: Specifies the starting DTAP path displayed in the UI of the DTAP browser." }, { "output": " This must be configured in order for data connectors to function properly. Example 1: Enable DataTap with No Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the DataTap data connector and disables authentication." }, { "output": " This lets users reference data stored in DTap directly using the name node address, for example: ``dtap://name.node/datasets/iris.csv`` or ``dtap://name.node/datasets/``." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,dtap\" \\\n -e DRIVERLESS_AI_DTAP_AUTH_TYPE='noauth' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure DataTap options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the DataTap data connector and disables authentication in the config.toml file." }, { "output": " (Note: The trailing slash is currently required for directories.) 1. Export the Driverless AI config.toml file or add it to ~/.bashrc." }, { "output": " Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n enabled_file_systems = \"file, dtap\"\n\n 3." }, { "output": " Example 2: Enable DataTap with Keytab-Based Authentication\n\n\nNotes: \n\n- If using Kerberos Authentication, the the time on the Driverless AI server must be in sync with Kerberos server." }, { "output": " - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user; otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authentication and, hence, fail." }, { "output": " - Configures the environment variable ``DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER`` to reference a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " - Configures the option ``dtap_app_prinicpal_user`` to reference a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n - ``dtap_auth_type = \"keytab\"``\n - ``dtap_key_tab_path = \"/tmp/\"``\n - ``dtap_app_principal_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # file : local file system/server file system\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n enabled_file_systems = \"file, dtap\"\n\n # Blue Data DTap connector settings are similar to HDFS connector settings." }, { "output": " If running\n # DAI as a service, then the Kerberos keytab needs to\n # be owned by the DAI user." }, { "output": " Save the changes when you are done, then stop/restart Driverless AI. Example 3: Enable DataTap with Keytab-Based Impersonation\n~\n\nNotes: \n\n- If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server." }, { "output": " .. tabs::\n .. group-tab:: Docker Image Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " - Configures the ``DRIVERLESS_AI_DTAP_APP_LOGIN_USER`` variable, which references a user who is being impersonated (usually in the form of user@realm)." }, { "output": " - Configures the ``dtap_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n - ``dtap_auth_type = \"keytabimpersonation\"``\n - ``dtap_key_tab_path = \"/tmp/\"``\n - ``dtap_app_principal_user = \"\"``\n - ``dtap_app_login_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " - Configures the ``dtap_app_login_user`` variable, which references a user who is being impersonated (usually in the form of user@realm)." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n \n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, dtap\"\n\n # Blue Data DTap connector settings are similar to HDFS connector settings." }, { "output": " If running\n # DAI as a service, then the Kerberos keytab needs to\n # be owned by the DAI user." }, { "output": " Data Recipe URL Setup\n-\n\nDriverless AI lets you explore data recipe URL data sources from within the Driverless AI application." }, { "output": " When enabled (default), you will be able to modify datasets that have been added to Driverless AI. (Refer to :ref:`modify_by_recipe` for more information.)" }, { "output": " These steps are provided in case this connector was previously disabled and you want to re-enable it." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Enable Data Recipe URL\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the data recipe URL data connector." }, { "output": " Note that ``recipe_url`` is enabled in the config.toml file by default. 1. Configure the Driverless AI config.toml file." }, { "output": " - ``enabled_file_systems = \"file, upload, recipe_url\"``\n\n 2. Mount the config.toml file into the Docker container." }, { "output": " Note that ``recipe_url`` is enabled by default. 1. Export the Driverless AI config.toml file or add it to ~/.bashrc." }, { "output": " Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, recipe_url\"\n\n 3." }, { "output": " AutoDoc Settings\n\n\nThis section includes settings that can be used to configure AutoDoc. ``make_autoreport``\n~\n\n.. dropdown:: Make AutoDoc\n\t:open:\n\n\tSpecify whether to create an AutoDoc for the experiment after it has finished running." }, { "output": " ``autodoc_report_name``\n~\n\n.. dropdown:: AutoDoc Name\n\t:open:\n\n\tSpecify a name for the AutoDoc report." }, { "output": " ``autodoc_template``\n\n\n.. dropdown:: AutoDoc Template Location\n\t:open:\n\n\tSpecify a path for the AutoDoc template:\n\n\t- To generate a custom AutoDoc template, specify the full path to your custom template." }, { "output": " ``autodoc_output_type``\n~\n\n.. dropdown:: AutoDoc File Output Type\n\t:open:\n\n\tSpecify the AutoDoc output type." }, { "output": " Choose from the following:\n\n\t- auto (Default)\n\t- md\n\t- docx\n\n``autodoc_max_cm_size``\n~\n\n.. dropdown:: Confusion Matrix Max Number of Classes\n\t:open:\n\n\tSpecify the maximum number of classes in the confusion matrix." }, { "output": " ``autodoc_num_features``\n\n\n.. dropdown:: Number of Top Features to Document\n\t:open:\n\n\tSpecify the number of top features to display in the document." }, { "output": " This is set to 50 by default. ``autodoc_min_relative_importance``\n~\n\n.. dropdown:: Minimum Relative Feature Importance Threshold\n\t:open:\n\n\tSpecify the minimum relative feature importance in order for a feature to be displayed." }, { "output": " This is set to 0.003 by default. ``autodoc_include_permutation_feature_importance``\n\n\n.. dropdown:: Permutation Feature Importance\n\t:open:\n\n\tSpecify whether to compute permutation-based feature importance." }, { "output": " ``autodoc_feature_importance_num_perm``\n~\n\n.. dropdown:: Number of Permutations for Feature Importance\n\t:open:\n\n\tSpecify the number of permutations to make per feature when computing feature importance." }, { "output": " ``autodoc_feature_importance_scorer``\n~\n\n.. dropdown:: Feature Importance Scorer\n\t:open:\n\n\tSpecify the name of the scorer to be used when calculating feature importance." }, { "output": " ``autodoc_pd_max_rows``\n~\n\n.. dropdown:: PDP Max Number of Rows\n\t:open:\n\n\tSpecify the number of rows for Partial Dependence Plots." }, { "output": " Set this value to -1 to disable the time limit. This is set to 20 seconds by default. ``autodoc_out_of_range``\n\n\n.. dropdown:: PDP Out of Range\n\t:open:\n\n\tSpecify the number of standard deviations outside of the range of a column to include in partial dependence plots." }, { "output": " This is set to 3 by default. ``autodoc_num_rows``\n\n\n.. dropdown:: ICE Number of Rows\n\t:open:\n\n\tSpecify the number of rows to include in PDP and ICE plots if individual rows are not specified." }, { "output": " ``autodoc_population_stability_index``\n\n\n.. dropdown:: Population Stability Index\n\t:open:\n\n\tSpecify whether to include a population stability index if the experiment is a binary classification or regression problem." }, { "output": " ``autodoc_population_stability_index_n_quantiles``\n\n\n.. dropdown:: Population Stability Index Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for the population stability index." }, { "output": " ``autodoc_prediction_stats``\n\n\n.. dropdown:: Prediction Statistics\n\t:open:\n\n\tSpecify whether to include prediction statistics information if the experiment is a binary classification or regression problem." }, { "output": " ``autodoc_prediction_stats_n_quantiles``\n\n\n.. dropdown:: Prediction Statistics Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for prediction statistics." }, { "output": " ``autodoc_response_rate``\n~\n\n.. dropdown:: Response Rates Plot\n\t:open:\n\n\tSpecify whether to include response rates information if the experiment is a binary classification problem." }, { "output": " ``autodoc_response_rate_n_quantiles``\n~\n\n.. dropdown:: Response Rates Plot Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for response rates information." }, { "output": " ``autodoc_gini_plot``\n~\n\n.. dropdown:: Show GINI Plot\n\t:open:\n\n\tSpecify whether to show the GINI plot." }, { "output": " ``autodoc_enable_shapley_values``\n~\n\n.. dropdown:: Enable Shapley Values\n\t:open:\n\n\tSpecify whether to show Shapley values results in the AutoDoc." }, { "output": " ``autodoc_data_summary_col_num``\n\n\n.. dropdown:: Number of Features in Data Summary Table\n\t:open:\n\n\tSpecify the number of features to be shown in the data summary table." }, { "output": " To show all columns, specify any value lower than 1. This is set to -1 by default. ``autodoc_list_all_config_settings``\n\n\n.. dropdown:: List All Config Settings\n\t:open:\n\n\tSpecify whether to show all config settings." }, { "output": " All settings are listed when enabled. This is disabled by default. ``autodoc_keras_summary_line_length``\n~\n\n.. dropdown:: Keras Model Architecture Summary Line Length\n\t:open:\n\n\tSpecify the line length of the Keras model architecture summary." }, { "output": " To use the default line length, set this value to -1 (default). ``autodoc_transformer_architecture_max_lines``\n\n\n.. dropdown:: NLP/Image Transformer Architecture Max Lines\n\t:open:\n\n\tSpecify the maximum number of lines shown for advanced transformer architecture in the Feature section." }, { "output": " ``autodoc_full_architecture_in_appendix``\n~\n\n.. dropdown:: Appendix NLP/Image Transformer Architecture\n\t:open:\n\n\tSpecify whether to show the full NLP/Image transformer architecture in the appendix." }, { "output": " ``autodoc_coef_table_appendix_results_table``\n~\n\n.. dropdown:: Full GLM Coefficients Table in the Appendix\n\t:open:\n\n\tSpecify whether to show the full GLM coefficient table(s) in the appendix." }, { "output": " ``autodoc_coef_table_num_models``\n~\n\n.. dropdown:: GLM Coefficient Tables Number of Models\n\t:open:\n\n\tSpecify the number of models for which a GLM coefficients table is shown in the AutoDoc." }, { "output": " Set this value to -1 to show tables for all models. This is set to 1 by default. ``autodoc_coef_table_num_folds``\n\n\n.. dropdown:: GLM Coefficient Tables Number of Folds Per Model\n\t:open:\n\n\tSpecify the number of folds per model for which a GLM coefficients table is shown in the AutoDoc." }, { "output": " ``autodoc_coef_table_num_coef``\n~\n\n.. dropdown:: GLM Coefficient Tables Number of Coefficients\n\t:open:\n\n\tSpecify the number of coefficients to show within a GLM coefficients table in the AutoDoc." }, { "output": " Set this value to -1 to show all coefficients. ``autodoc_coef_table_num_classes``\n\n\n.. dropdown:: GLM Coefficient Tables Number of Classes\n\t:open:\n\n\tSpecify the number of classes to show within a GLM coefficients table in the AutoDoc." }, { "output": " This is set to 9 by default. ``autodoc_num_histogram_plots``\n~\n\n.. dropdown:: Number of Histograms to Show\n\t:open:\n\n\tSpecify the number of top features for which to show histograms." }, { "output": " Snowflake Setup\n- \n\nDriverless AI allows you to explore Snowflake data sources from within the Driverless AI application." }, { "output": " This setup requires you to enable authentication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``snowflake_account``: The Snowflake account ID\n- ``snowflake_user``: The username for accessing the Snowflake account\n- ``snowflake_password``: The password for accessing the Snowflake account\n- ``enabled_file_systems``: The file systems you want to enable." }, { "output": " Enable Snowflake with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the Snowflake data connector with authentication by passing the ``account``, ``user``, and ``password`` variables." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, snow\"``\n - ``snowflake_account = \"\"``\n - ``snowflake_user = \"\"``\n - ``snowflake_password = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Snowflake data connector with authentication by passing the ``account``, ``user``, and ``password`` variables." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, snow\"\n\n # Snowflake Connector credentials\n snowflake_account = \"\"\n snowflake_user = \"\"\n snowflake_password = \"\"\n\n 3." }, { "output": " Adding Datasets Using Snowflake\n \n\nAfter the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " 1. Enter Database: Specify the name of the Snowflake database that you are querying. 2. Enter Warehouse: Specify the name of the Snowflake warehouse that you are querying." }, { "output": " Enter Schema: Specify the schema of the dataset that you are querying. 4. Enter Name for Dataset to Be Saved As: Specify a name for the dataset to be saved as." }, { "output": " 5. Enter Username: (Optional) Specify the username associated with this Snowflake account. This can be left blank if ``snowflake_user`` was specified in the config.toml when starting Driverless AI; otherwise, this field is required." }, { "output": " Enter Password: (Optional) Specify the password associated with this Snowflake account. This can be left blank if ``snowflake_password`` was specified in the config.toml when starting Driverless AI; otherwise, this field is required." }, { "output": " Enter Role: (Optional) Specify your role as designated within Snowflake. See https://docs.snowflake.net/manuals/user-guide/security-access-control-overview.html for more information." }, { "output": " Enter Region: (Optional) Specify the region of the warehouse that you are querying. This can be found in the Snowflake-provided URL to access your database (as in ...snowflakecomputing.com)." }, { "output": " 9. Enter File Formatting Parameters: (Optional) Specify any additional parameters for formatting your datasets." }, { "output": " (Note: Use only parameters for ``TYPE = CSV``.) For example, if your dataset includes a text column that contains commas, you can specify a different delimiter using ``FIELD_DELIMITER='character'``." }, { "output": " For example, you might specify the following to load the \"AMAZON_REVIEWS\" dataset:\n\n * Database: UTIL_DB\n * Warehouse: DAI_SNOWFLAKE_TEST\n * Schema: AMAZON_REVIEWS_SCHEMA\n * Query: SELECT * FROM AMAZON_REVIEWS\n * Enter File Formatting Parameters (Optional): FIELD_OPTIONALLY_ENCLOSED_BY = '\"' \n\n In the above example, if the ``FIELD_OPTIONALLY_ENCLOSED_BY`` option is not set, the following row will result in a failure to import the dataset (as the dataset's delimiter is ``,`` by default):\n\n ::\n \n positive, 2012-05-03,Wonderful\\, tasty taffy,0,0,3,5,2012,Thu,0\n\n Note: Numeric columns from Snowflake that have NULL values are sometimes converted to strings (for example, `\\\\ \\\\N`)." }, { "output": " 10. Enter Snowflake Query: Specify the Snowflake query that you want to execute. 11. When you are finished, select the Click to Make Query button to add the dataset." }, { "output": " .. _install-on-windows:\n\nWindows 10\n\n\nThis section describes how to install, start, stop, and upgrade Driverless AI on a Windows 10 machine." }, { "output": " For information on how to obtain a license key for Driverless AI, visit https://h2o.ai/o/try-driverless-ai/." }, { "output": " Overview of Installation on Windows\n~\n\nTo install Driverless AI on Windows, use a Driverless AI Docker image." }, { "output": " - Scoring is not available on Windows. Caution: Installing Driverless AI on Windows 10 is not recommended for serious use." }, { "output": " | Min Mem | Suitable for |\n+=+=+=+=+\n| Windows 10 Pro | No | 16 GB | Experimentation |\n+-+-+-+-+\n| Windows 10 Enterprise | No | 16 GB | Experimentation |\n+-+-+-+-+\n| Windows 10 Education | No | 16 GB | Experimentation |\n+-+-+-+-+\n\nNote: Driverless AI cannot be installed on versions of Windows 10 that do not support Hyper-V." }, { "output": " Docker Image Installation\n~\n\nNotes: \n\n- Be aware that there are known issues with Docker for Windows." }, { "output": " - Consult with your Windows System Admin if \n\n - Your corporate environment does not allow third-part software installs\n - You are running Windows Defender\n - You your machine is not running with ``Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux``." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same." }, { "output": " Installation Procedure\n\n\n1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/." }, { "output": " Download, install, and run Docker for Windows from https://docs.docker.com/docker-for-windows/install/." }, { "output": " Note that you may have to reboot after installation. 3. Before running Driverless AI, you must:\n\n - Enable shared access to the C drive." }, { "output": " - Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won\u2019t run at all with less than 10 GB of memory." }, { "output": " You can adjust these settings by clicking on the Docker whale in your taskbar (look for hidden tasks, if necessary), then selecting Settings > Shared Drive and Settings > Advanced as shown in the following screenshots." }, { "output": " (Docker will restart.) Note that if you cannot make changes, stop Docker and then start Docker again by right clicking on the Docker icon on your desktop and selecting Run as Administrator." }, { "output": " Open a PowerShell terminal and set up a directory for the version of Driverless AI on the host machine: \n\n .. code-block:: bash\n :substitutions:\n\n md |VERSION-dir|\n\n5." }, { "output": " Move the downloaded Driverless AI image to your new directory. 6. Change directories to the new directory, then load the image using the following command:\n\n .. code-block:: bash\n :substitutions:\n \n cd |VERSION-dir|\n docker load -i .\\dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " .. code-block:: bash\n\n md data\n md log\n md license\n md tmp\n\n8. Copy data into the /data directory." }, { "output": " 9. Run ``docker images`` to find the image tag. 10. Start the Driverless AI Docker image. Be sure to replace ``path_to_`` below with the entire path to the location of the folders that you created (for example, \"c:/Users/user-name/driverlessai_folder/data\")." }, { "output": " GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini prints a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " Add Custom Recipes\n\n\nCustom recipes are Python code snippets that can be uploaded into Driverless AI at runtime like plugins." }, { "output": " If you do not have a custom recipe, you can select from a number of recipes available in the `Recipes for H2O Driverless AI repository `_." }, { "output": " To add a custom recipe to Driverless AI, click Add Custom Recipe and select one of the following options:\n\n- From computer: Add a custom recipe as a Python or ZIP file from your local file system." }, { "output": " - From Bitbucket: Add a custom recipe from a Bitbucket repository. To use this option, your Bitbucket username and password must be provided along with the custom recipe Bitbucket URL." }, { "output": " .. _edit-toml:\n\nEditing the TOML Configuration\n\n\nTo open the built-in TOML configuration editor, click TOML in the :ref:`expert-settings` window." }, { "output": " For example, if you set the Make MOJO scoring pipeline setting in the Experiment tab to Off, then the line ``make_mojo_scoring_pipeline = \"off\"`` is displayed in the TOML editor." }, { "output": " To confirm your changes, click Save. The experiment preview updates to reflect your specified configuration changes." }, { "output": " .. note::\n\tDo not edit the section below the ``[recipe_activation]`` line. This section provides Driverless AI with information about which custom recipes can be used by the experiment." }, { "output": " .. _h2o_drive:\n\n###############\nH2O Drive setup\n###############\n\nH2O Drive is an object-store for `H2O AI Cloud `_." }, { "output": " Note: For more information on the H2O Drive, refer to the `official documentation `_." }, { "output": " To enable the Feature Store data connector, ``h2o_drive`` must be added to this list of data sources." }, { "output": " - ``h2o_drive_access_token_scopes``: A space-separated list of OpenID scopes for the access token that are used by the H2O Drive connector." }, { "output": " - ``authentication_method``: The authentication method used by DAI. When enabling the Feature Store data connector, this must be set to OpenID Connect (``authentication_method=\"oidc\"``)." }, { "output": " .. _install-on-macosx:\n\nMac OS X\n\n\nThis section describes how to install, start, stop, and upgrade the Driverless AI Docker image on Mac OS X." }, { "output": " Note: Support for GPUs and MOJOs is not available on Mac OS X. The installation steps assume that you have a license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " Stick to small datasets! For serious use, please use Linux. - Be aware that there are known performance issues with Docker for Mac." }, { "output": " Environment\n~\n\n+-+-+-+-+\n| Operating System | GPU Support? | Min Mem | Suitable for |\n+=+=+=+=+\n| Mac OS X | No | 16 GB | Experimentation |\n+-+-+-+-+\n\nInstalling Driverless AI\n\n\n1." }, { "output": " 2. Download and run Docker for Mac from https://docs.docker.com/docker-for-mac/install. 3. Adjust the amount of memory given to Docker to be at least 10 GB." }, { "output": " You can optionally adjust the number of CPUs given to Docker. You will find the controls by clicking on (Docker Whale)->Preferences->Advanced as shown in the following screenshots." }, { "output": " .. image:: ../images/macosx_docker_menu_bar.png\n :align: center\n\n.. image:: ../images/macosx_docker_advanced_preferences.png\n :align: center\n :height: 507\n :width: 382\n\n4." }, { "output": " More information is available here: https://docs.docker.com/docker-for-mac/osxfs/#namespaces. .. image:: ../images/macosx_docker_filesharing.png\n :align: center\n :scale: 40%\n\n5." }, { "output": " With Docker running, open a Terminal and move the downloaded Driverless AI image to your new directory." }, { "output": " Change directories to the new directory, then load the image using the following command:\n\n .. code-block:: bash\n :substitutions:\n\n cd |VERSION-dir|\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n8." }, { "output": " Optionally copy data into the data directory on the host. The data will be visible inside the Docker container at /data." }, { "output": " 10. Run ``docker images`` to find the image tag. 11. Start the Driverless AI Docker image (still within the new Driverless AI directory)." }, { "output": " Note that GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini prints a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " Connect to Driverless AI with your browser at http://localhost:12345. Stopping the Docker Image\n~\n\n.. include:: stop-docker.rst\n\nUpgrading the Docker Image\n\n\nThis section provides instructions for upgrading Driverless AI versions that were installed in a Docker container." }, { "output": " WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded when Driverless AI is upgraded." }, { "output": " - Build MOJO pipelines before upgrading. - Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading." }, { "output": " Before upgrading, be sure to run MLI jobs on models that you want to continue to interpret in future releases." }, { "output": " If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be able to build a MOJO pipeline on that model after upgrading." }, { "output": " Note: Stop Driverless AI if it is still running. Upgrade Steps\n'\n\n1. SSH into the IP address of the machine that is running Driverless AI." }, { "output": " Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n # cd into the new directory\n cd |VERSION-dir|\n\n3." }, { "output": " 4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " .. _features-settings:\n\nFeatures Settings\n=\n\n``feature_engineering_effort``\n\n\n.. dropdown:: Feature Engineering Effort\n\t:open:\n\n\tSpecify a value from 0 to 10 for the Driverless AI feature engineering effort." }, { "output": " This value defaults to 5. - 0: Keep only numeric features. Only model tuning during evolution. - 1: Keep only numeric features and frequency-encoded categoricals." }, { "output": " - 2: Similar to 1 but instead just no Text features. Some feature tuning before evolution. - 3: Similar to 5 but only tuning during evolution." }, { "output": " - 4: Similar to 5 but slightly more focused on model tuning. - 5: Balanced feature-model tuning. (Default)\n\t- 6-7: Similar to 5 but slightly more focused on feature engineering." }, { "output": " - 9-10: Similar to 8 but no model tuning during feature evolution. .. _check_distribution_shift:\n\n``check_distribution_shift``\n\n\n.. dropdown:: Data Distribution Shift Detection\n\t:open:\n\n\tSpecify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided)." }, { "output": " Currently, this information is only presented to the user and not acted upon. Shifted features should either be dropped." }, { "output": " Also see :ref:`drop_features_distribution_shift_threshold_auc ` and :ref:`check_distribution_shift_drop `." }, { "output": " This defaults to Auto. Note that Auto for time series experiments turns this feature off. Also see :ref:`drop_features_distribution_shift_threshold_auc ` and :ref:`check_distribution_shift `." }, { "output": " When train and test dataset differ (or train/valid or valid/test) in terms of distribution of data, then a model can be built that tells for each row, whether the row is in train or test." }, { "output": " If this AUC, GINI, or Spearman correlation of the model is above the specified threshold, then Driverless AI will consider it a strong enough shift to drop those features." }, { "output": " .. _check_leakage:\n\n``check_leakage``\n~\n\n.. dropdown:: Data Leakage Detection\n\t:open:\n\n\tSpecify whether to check for data leakage for each feature." }, { "output": " This may affect model generalization. Driverless AI runs a model to determine the predictive power of each feature on the target variable." }, { "output": " The models with high AUC (for classification) or R2 score (regression) are reported to the user as potential leak." }, { "output": " This is set to Auto by default. The equivalent config.toml parameter is ``check_leakage``. Also see :ref:`drop_features_leakage_threshold_auc `\n\n.. _drop_features_leakage_threshold_auc:\n\n``drop_features_leakage_threshold_auc``\n~\n\n.. dropdown:: Data Leakage Detection Dropping AUC/R2 Threshold\n\t:open:\n\n\tIf :ref:`Leakage Detection ` is enabled, specify the threshold for dropping features." }, { "output": " This value defaults to 0.999. The equivalent config.toml parameter is ``drop_features_leakage_threshold_auc``." }, { "output": " This value defaults to 10,000,000. ``max_features_importance``\n~\n\n.. dropdown:: Max. num. features for variable importance\n\t:open:\n\n\tSpecify the maximum number of features to use and show in importance tables." }, { "output": " Higher values can lead to lower performance and larger disk space used for datasets with more than 100k columns." }, { "output": " of columns > no. of rows). The default value is \"auto\", that will automatically enable the wide rules when detect that number of columns is greater than number of rows." }, { "output": " Enabling wide data rules sets all ``max_cols``, ``max_orig_*col``, and ``fs_orig*`` tomls to large values, and enforces monotonicity to be disabled unless ``monotonicity_constraints_dict`` is set or default value of ``monotonicity_constraints_interpretability_switch`` is changed." }, { "output": " And enables :ref:`Xgboost Random Forest model ` for modeling. To disable wide rules, set enable_wide_rules to \"off\"." }, { "output": " Also see :ref:`wide_datasets_dai` for a quick model run. ``orig_features_fs_report``\n~\n\n.. dropdown:: Report Permutation Importance on Original Features\n\t:open:\n\n\tSpecify whether Driverless AI reports permutation importance on original features (represented as normalized change in the chosen metric) in logs and the report file." }, { "output": " ``max_rows_fs``\n~\n\n.. dropdown:: Maximum Number of Rows to Perform Permutation-Based Feature Selection\n\t:open:\n\n\tSpecify the maximum number of rows when performing permutation feature importance, reduced by (stratified) random sampling." }, { "output": " ``max_orig_cols_selected``\n\n\n.. dropdown:: Max Number of Original Features Used\n\t:open:\n\n\tSpecify the maximum number of columns to be selected from an existing set of columns using feature selection." }, { "output": " For categorical columns, the selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals helps." }, { "output": " First the best [max_orig_cols_selected] are found through feature selection methods and then these features are used in feature evolution (to derive other features) and in modelling." }, { "output": " Feature selection is performed on all features when this value is exceeded. This value defaults to 300." }, { "output": " This value defaults to 10,0000000. Additional columns above the specified value add special individual with original columns reduced." }, { "output": " Note that this is applicable only to special individuals with original columns reduced. A separate individual in the :ref:`genetic algorithm ` is created by doing feature selection by permutation importance on original features." }, { "output": " ``fs_orig_nonnumeric_cols_selected``\n\n\n.. dropdown:: Number of Original Non-Numeric Features to Trigger Feature Selection Model Type\n\t:open:\n\n\tThe maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all features." }, { "output": " A separate individual in the :ref:`genetic algorithm ` is created by doing feature selection by permutation importance on original features." }, { "output": " ``max_relative_cardinality``\n\n\n.. dropdown:: Max Allowed Fraction of Uniques for Integer and Categorical Columns\n\t:open:\n\n\tSpecify the maximum fraction of unique values for integer and categorical columns." }, { "output": " This value defaults to 0.95. .. _num_as_cat:\n\n``num_as_cat``\n\n\n.. dropdown:: Allow Treating Numerical as Categorical\n\t:open:\n\n\tSpecify whether to allow some numerical features to be treated as categorical features." }, { "output": " The equivalent config.toml parameter is ``num_as_cat``. ``max_int_as_cat_uniques``\n\n\n.. dropdown:: Max Number of Unique Values for Int/Float to be Categoricals\n\t:open:\n\n\tSpecify the number of unique values for integer or real columns to be treated as categoricals." }, { "output": " ``max_fraction_invalid_numeric``\n\n\n.. dropdown:: Max. fraction of numeric values to be non-numeric (and not missing) for a column to still be considered numeric\n\t:open:\n\n\tWhen the fraction of non-numeric (and non-missing) values is less or equal than this value, consider the column numeric." }, { "output": " Note: Replaces non-numeric values with missing values at start of experiment, so some information is lost, but column is now treated as numeric, which can help." }, { "output": " .. _nfeatures_max:\n\n``nfeatures_max``\n~\n\n.. dropdown:: Max Number of Engineered Features\n\t:open:\n\n\tSpecify the maximum number of features to be included per model (and in each model within the final model if an ensemble)." }, { "output": " Final ensemble will exclude any pruned-away features and only train on kept features, but may contain a few new features due to fitting on different data view (e.g." }, { "output": " Final scoring pipeline will exclude any pruned-away features, but may contain a few new features due to fitting on different data view (e.g." }, { "output": " The default value of -1 means no restrictions are applied for this parameter except internally-determined memory and interpretability restrictions." }, { "output": " Otherwise, only mutations of scored individuals will be pruned (until the final model where limits are strictly applied)." }, { "output": " * E.g. to generally limit every iteration to exactly 1 features, one must set ``nfeatures_max`` = ``ngenes_max`` =1 and ``remove_scored_0gain_genes_in_postprocessing_above_interpretability`` = 0, but the genetic algorithm will have a harder time finding good features." }, { "output": " .. _ngenes_max:\n\n``ngenes_max``\n\n\n.. dropdown:: Max Number of Genes\n\t:open:\n\n\tSpecify the maximum number of genes (transformer instances) kept per model (and per each model within the final model for ensembles)." }, { "output": " If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes." }, { "output": " A value of -1 means no restrictions except internally-determined memory and interpretability restriction." }, { "output": " ``features_allowed_by_interpretability``\n\n\n.. dropdown:: Limit Features by Interpretability\n\t:open:\n\n\tSpecify whether to limit feature counts with the Interpretability training setting as specified by the ``features_allowed_by_interpretability`` :ref:`config.toml ` setting." }, { "output": " This value defaults to 7. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference." }, { "output": " This value defaults to 0.1. Note: This setting is only enabled when Interpretability is greater than or equal to the value specified by the :ref:`enable-constraints` setting and when the :ref:`constraints-override` setting is not specified." }, { "output": " ``monotonicity_constraints_log_level``\n\n\n.. dropdown:: Control amount of logging when calculating automatic monotonicity constraints (if enabled)\n\t:open:\n\n\tFor models that support monotonicity constraints, and if enabled, show automatically determined monotonicity constraints for each feature going into the model based on its correlation with the target." }, { "output": " 'medium' shows correlation of positively and negatively constraint features. 'high' shows all correlation values." }, { "output": " .. _monotonicity-constraints-drop-low-correlation-features:\n\n``monotonicity_constraints_drop_low_correlation_features``\n\n\n.. dropdown:: Whether to drop features that have no monotonicity constraint applied (e.g., due to low correlation with target)\n\t:open:\n\n\tIf enabled, only monotonic features with +1/-1 constraints will be passed to the model(s), and features without monotonicity constraints (0) will be dropped." }, { "output": " Only active when interpretability >= monotonicity_constraints_interpretability_switch or monotonicity_constraints_dict is provided." }, { "output": " .. _constraints-override:\n\n``monotonicity_constraints_dict``\n\n\n.. dropdown:: Manual Override for Monotonicity Constraints\n\t:open:\n\n\tSpecify a list of features for max_features_importance which monotonicity constraints are applied." }, { "output": " The following is an example of how this list can be specified:\n\n\t::\n\n\t \"{'PAY_0': -1, 'PAY_2': -1, 'AGE': -1, 'BILL_AMT1': 1, 'PAY_AMT1': -1}\"\n\n\tNote: If a list is not provided, then the automatic correlation-based method is used when monotonicity constraints are enabled at high enough interpretability settings." }, { "output": " .. _max-feature-interaction-depth:\n\n``max_feature_interaction_depth``\n~\n\n.. dropdown:: Max Feature Interaction Depth\n\t:open:\n\n\tSpecify the maximum number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates." }, { "output": " The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + \u2026 featureN)." }, { "output": " The depth of the interaction level (as in \"up to\" how many features may be combined at once to create one single feature) can be specified to control the complexity of the feature engineering process." }, { "output": " This value defaults to 8. Set Max Feature Interaction Depth to 1 to disable any feature interactions ``max_feature_interaction_depth=1``." }, { "output": " To use all features for each transformer, set this to be equal to the number of columns. To do a 50/50 sample and a fixed feature interaction depth of :math:`n` features, set this to -:math:`n`." }, { "output": " Target encoding refers to several different feature transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual target variable." }, { "output": " These type of features can be very predictive but are prone to overfitting and require more memory as they need to store mappings of the unique categories and the target values." }, { "output": " The degree to which GINI is inaccurate is also used to perform fold-averaging of look-up tables instead of using global look-up tables." }, { "output": " ``enable_lexilabel_encoding``\n~\n\n.. dropdown:: Enable Lexicographical Label Encoding\n\t:open:\n\n\tSpecify whether to enable lexicographical label encoding." }, { "output": " ``enable_isolation_forest``\n~\n\n.. dropdown:: Enable Isolation Forest Anomaly Score Encoding\n\t:open:\n\n\t`Isolation Forest `__ is useful for identifying anomalies or outliers in data." }, { "output": " This split depends on how long it takes to separate the points. Random partitioning produces noticeably shorter paths for anomalies." }, { "output": " This option lets you specify whether to return the anomaly score of each sample. This is disabled by default." }, { "output": " The default Auto setting is only applicable for small datasets and GLMs. ``isolation_forest_nestimators``\n\n\n.. dropdown:: Number of Estimators for Isolation Forest Encoding\n\t:open:\n\n\tSpecify the number of estimators for `Isolation Forest `__ encoding." }, { "output": " ``drop_constant_columns``\n~\n\n.. dropdown:: Drop Constant Columns\n\t:open:\n\n\tSpecify whether to drop columns with constant values." }, { "output": " ``drop_id_columns``\n~\n\n.. dropdown:: Drop ID Columns\n\t:open:\n\n\tSpecify whether to drop columns that appear to be an ID." }, { "output": " ``no_drop_features``\n\n\n.. dropdown:: Don't Drop Any Columns\n\t:open:\n\n\tSpecify whether to avoid dropping any columns (original or derived)." }, { "output": " .. _features_to_drop:\n\n``cols_to_drop``\n\n\n.. dropdown:: Features to Drop\n\t:open:\n\n\tSpecify which features to drop." }, { "output": " .. _cols_to_force_in:\n\n``cols_to_force_in``\n~\n\n.. dropdown:: Features to always keep or force in, e.g." }, { "output": " Forced-in features are handled by the most interpretable transformers allowed by the experiment options, and they are never removed (even if the model assigns 0 importance to them)." }, { "output": " When this field is left empty (default), Driverless AI automatically searches all columns (either at random or based on which columns have high variable importance)." }, { "output": " This is disabled by default. ``agg_funcs_for_group_by``\n\n\n.. dropdown:: Aggregation Functions (Non-Time-Series) for Group By Operations\n\t:open:\n\n\tSpecify whether to enable aggregation functions to use for group by operations." }, { "output": " Out-of-fold aggregations will result in less overfitting, but they analyze less data in each fold. The default value is 5." }, { "output": " Select from the following:\n\n\t- sample: Sample transformer parameters (Default)\n\t- batched: Perform multiple types of the same transformation together\n\t- full: Perform more types of the same transformation together than the above strategy\n\n``dump_varimp_every_scored_indiv``\n\n\n.. dropdown:: Enable Detailed Scored Features Info\n\t:open:\n\n\tSpecify whether to dump every scored individual's variable importance (both derived and original) to a csv/tabulated/json file." }, { "output": " This is disabled by default. ``dump_trans_timings``\n\n\n.. dropdown:: Enable Detailed Logs for Timing and Types of Features Produced\n\t:open:\n\n\tSpecify whether to dump every scored fold's timing and feature info to a timings.txt file." }, { "output": " ``compute_correlation``\n~\n\n.. dropdown:: Compute Correlation Matrix\n\t:open:\n\n\tSpecify whether to compute training, validation, and test correlation matrixes." }, { "output": " Note that this setting is currently a single threaded process that may be slow for experiments with many columns." }, { "output": " ``interaction_finder_gini_rel_improvement_threshold``\n~\n\n.. dropdown:: Required GINI Relative Improvement for Interactions\n\t:open:\n\n\tSpecify the required GINI relative improvement value for the InteractionTransformer." }, { "output": " If the data is noisy and there is no clear signal in interactions, this value can be decreased to return interactions." }, { "output": " ``interaction_finder_return_limit``\n~\n\n.. dropdown:: Number of Transformed Interactions to Make\n\t:open:\n\n\tSpecify the number of transformed interactions to make from generated trial interactions." }, { "output": " This value defaults to 5. .. _enable_rapids_transformers:\n\n``enable_rapids_transformers``\n\n\n.. dropdown:: Whether to enable RAPIDS cuML GPU transformers (no mojo)\n\t:open:\n\n\tSpecify whether to enable GPU-based `RAPIDS cuML `__ transformers." }, { "output": " The equivalent config.toml parameter is ``enable_rapids_transformers`` and the default value is False." }, { "output": " This setting also sets the overall scale for lower interpretability settings. Set this to a lower value if you're content with having many weak features despite choosing high interpretability, or if you see a drop in performance due to the need for weak features." }, { "output": " Delta improvement of score corresponds to original metric minus metric of shuffled feature frame if maximizing metric, and corresponds to negative of such a score difference if minimizing." }, { "output": " Note, if using tree methods, multiple depths may be fitted, in which case regardless of this toml setting, only features that are kept for all depths are kept by feature selection." }, { "output": " .. _linux:\n\nLinux x86_64 Installs\n-\n\nThis section provides installation steps for RPM, deb, and tar installs in Linux x86_64 environments." }, { "output": " Hive Setup\n\n\nDriverless AI lets you explore Hive data sources from within the Driverless AI application." }, { "output": " Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``enabled_file_systems``: The file systems you want to enable." }, { "output": " - ``hive_app_configs``: Configuration for Hive Connector. Inputs are similar to configuring the HDFS connector." }, { "output": " This can have multiple files (e.g. hive-site.xml, hdfs-site.xml, etc.) - ``auth_type``: Specify one of ``noauth``, ``keytab``, or ``keytabimpersonation`` for Kerberos authentication\n - ``keytab_path``: Specify the path to Kerberos keytab to use for authentication (this can be ``\"\"`` if using ``auth_type=\"noauth\"``)\n - ``principal_user``: Specify the Kerberos app principal user (required when using ``auth_type=\"keytab\"`` or ``auth_type=\"keytabimpersonation\"``)\n\nNotes:\n\n- With Hive connectors, it is assumed that DAI is running on the edge node." }, { "output": " missing classes, dependencies, authorization errors). - Ensure the core-site.xml file (from e.g Hadoop conf) is also present in the Hive conf with the rest of the files (hive-site.xml, hdfs-site.xml, etc.)." }, { "output": " ``hadoop.proxyuser.hive.hosts`` & ``hadoop.proxyuser.hive.groups``). - If you have tez as the Hive execution engine, make sure that the required tez dependencies (classpaths, jars, etc.)" }, { "output": " Alternatively, you can use internal engines that come with DAI by changing your ``hive.execution.engine`` value in the hive-site.xml file to ``mr`` or ``spark``." }, { "output": " For example:\n \n ::\n\n \"\"\"{\n \"hive_connection_1\": {\n \"hive_conf_path\": \"/path/to/hive/conf\",\n \"auth_type\": \"one of ['noauth', 'keytab',\n 'keytabimpersonation']\",\n \"keytab_path\": \"/path/to/.keytab\",\n \"principal_user\": \"hive/node1.example.com@EXAMPLE.COM\",\n },\n \"hive_connection_2\": {\n \"hive_conf_path\": \"/path/to/hive/conf_2\",\n \"auth_type\": \"one of ['noauth', 'keytab', \n 'keytabimpersonation']\",\n \"keytab_path\": \"/path/to/.keytab\",\n \"principal_user\": \"hive/node2.example.com@EXAMPLE.COM\",\n }\n }\"\"\"\n\n \\ Note: The expected input of ``hive_app_configs`` is a `JSON string `__." }, { "output": " Depending on how the configuration value is applied, different forms of outer quotations may be required." }, { "output": " - Configuration value applied with the config.toml file:\n\n ::\n\n hive_app_configs = \"\"\"{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}\"\"\"\n\n - Configuration value applied with an environment variable:\n\n ::\n\n DRIVERLESS_AI_HIVE_APP_CONFIGS='{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}'\n\n- ``hive_app_jvm_args``: Optionally specify additional Java Virtual Machine (JVM) args for the Hive connector." }, { "output": " Notes:\n\n - If a custom `JAAS configuration file `__ is needed for your Kerberos setup, use ``hive_app_jvm_args`` to specify the appropriate file:\n\n ::\n\n hive_app_jvm_args = \"-Xmx20g -Djava.security.auth.login.config=/etc/dai/jaas.conf\"\n\n Sample ``jaas.conf`` file:\n ::\n\n com.sun.security.jgss.initiate {\n com.sun.security.auth.module.Krb5LoginModule required\n useKeyTab=true\n useTicketCache=false\n principal=\"hive/localhost@EXAMPLE.COM\" [Replace this line]\n doNotPrompt=true\n keyTab=\"/path/to/hive.keytab\" [Replace this line]\n debug=true;\n };\n\n- ``hive_app_classpath``: Optionally specify an alternative classpath for the Hive connector." }, { "output": " This can be done by specifying each environment variable in the ``nvidia-docker run`` command or by editing the configuration options in the config.toml file and then specifying that file in the ``nvidia-docker run`` command." }, { "output": " Start the Driverless AI Docker Image. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs,hive\" \\\n -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{\"hive_connection_2: {\"hive_conf_path\":\"/etc/hadoop/conf\",\n \"auth_type\":\"keytabimpersonation\",\n \"keytab_path\":\"/etc/dai/steam.keytab\",\n \"principal_user\":\"steam/mr-0xg9.0xdata.loc@H2OAI.LOC\"}}' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -v /path/to/hive/conf:/path/to/hive/conf/in/docker \\\n -v /path/to/hive.keytab:/path/in/docker/hive.keytab \\\n -u $(id -u):${id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure Hive options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Enable and configure the Hive connector in the Driverless AI config.toml file. The Hive connector configuration must be a JSON/Dictionary string with multiple keys." }, { "output": " Mount the config.toml file into the Docker container. .. code-block:: bash \n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro /\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -v /path/to/hive/conf:/path/to/hive/conf/in/docker \\\n -v /path/to/hive.keytab:/path/in/docker/hive.keytab \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Native Installs\n\n This enables the Hive connector." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\"\n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs, s3, hive\"\n\n \n # Configuration for Hive Connector\n # Note that inputs are similar to configuring HDFS connectivity\n # Important keys:\n # * hive_conf_path - path to hive configuration, may have multiple files." }, { "output": " Required when using auth_type `keytab` or `keytabimpersonation`\n # JSON/Dictionary String with multiple keys." }, { "output": " Save the changes when you are done, then stop/restart Driverless AI. Adding Datasets Using Hive\n~\n\nAfter the Hive connector is enabled, you can add datasets by selecting Hive from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " Select the Hive configuraton that you want to use. .. figure:: ../images/hive_select_configuration.png\n :alt: Select Hive configuration\n\n2." }, { "output": " - Hive Database: Specify the name of the Hive database that you are querying. - Hadoop Configuration Path: Specify the path to your Hive configuration file." }, { "output": " - Hive Kerberos Principal: Specify the Hive Kerberos principal. This is required if the Hive Authentication Type is keytabimpersonation." }, { "output": " This can be noauth, keytab, or keytabimpersonation. - Enter Name for Dataset to be saved as: Optionally specify a new name for the dataset that you are uploading." }, { "output": " Install on Ubuntu\n-\n\nThis section describes how to install the Driverless AI Docker image on Ubuntu." }, { "output": " Environment\n~\n\n+-+-+-+\n| Operating System | GPUs? | Min Mem |\n+=+=+=+\n| Ubuntu with GPUs | Yes | 64 GB |\n+-+-+-+\n| Ubuntu with CPUs | No | 64 GB |\n+-+-+-+\n\n.. _install-on-ubuntu-with-gpus:\n\nInstall on Ubuntu with GPUs\n~\n\nNote: Driverless AI is supported on Ubuntu 16.04 or later." }, { "output": " Once you are logged in, perform the following steps. 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/." }, { "output": " 2. Install and run Docker on Ubuntu (if not already installed):\n\n .. code-block:: bash\n\n # Install and run Docker on Ubuntu\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \\ \n \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" \n sudo apt-get update\n sudo apt-get install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " More information is available at https://github.com/NVIDIA/nvidia-docker/blob/master/README.md. .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(." }, { "output": " Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver: \n\n .. code-block:: bash\n\n nvidia-smi\n\n5." }, { "output": " Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Set up the data, log, and license directories on the host machine:\n\n .. code-block:: bash\n\n # Set up the data, log, license, and tmp directories on the host machine (within the new directory)\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n9." }, { "output": " The data will be visible inside the Docker container. 10. Run ``docker images`` to find the image tag." }, { "output": " Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag:\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n12." }, { "output": " This section describes how to install and start the Driverless AI Docker image on Ubuntu. Note that this uses ``docker`` and not ``nvidia-docker``." }, { "output": " Watch the installation video `here `__." }, { "output": " Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following steps." }, { "output": " Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Install and run Docker on Ubuntu (if not already installed):\n\n .. code-block:: bash\n\n # Install and run Docker on Ubuntu\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \\ \n \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n sudo apt-get update\n sudo apt-get install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " At this point, you can copy data into the data directory on the host machine. The data will be visible inside the Docker container." }, { "output": " Run ``docker images`` to find the new image tag. 8. Start the Driverless AI Docker image. Note that GPU support will not be available." }, { "output": " We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. _linux-tarsh:\n\nLinux TAR SH\n\n\nThe Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive." }, { "output": " This artifact has the same compatibility matrix as the RPM and DEB packages (combined), it just comes packaged slightly differently." }, { "output": " The installation steps assume that you have a valid license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/." }, { "output": " .. note::\n\tTo ensure that :ref:`AutoDoc ` pipeline visualizations are generated correctly on native installations, installing `fontconfig `_ is recommended." }, { "output": " Note that if you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02\n- OpenCL (Required for full LightGBM support on GPU-powered systems)\n- Driverless AI TAR SH, available from https://www.h2o.ai/download/\n\nNote: CUDA 11.2.2 (for GPUs) and cuDNN (required for TensorFlow support on GPUs) are included in the Driverless AI package." }, { "output": " To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used." }, { "output": " Installing Driverless AI\n\n\nRun the following commands to install the Driverless AI TAR SH. .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI." }, { "output": " Starting Driverless AI\n\n\n.. code-block:: bash\n \n # Start Driverless AI. ./run-dai.sh\n\nStarting NVIDIA Persistence Mode\n\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command." }, { "output": " For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\nInstall OpenCL\n\n\nOpenCL is required in order to run LightGBM on GPUs." }, { "output": " .. code-block:: bash\n\n yum -y clean all\n yum -y makecache\n yum -y update\n wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm\n wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.x86_64.rpm\n rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm\n rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm\n clinfo\n\n mkdir -p /etc/OpenCL/vendors && \\\n echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd\n\nLooking at Driverless AI log files\n\n\n.. code-block:: bash\n\n less log/dai.log\n less log/h2o.log\n less log/procsy.log\n less log/vis-server.log\n\nStopping Driverless AI\n\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " By default, all files for Driverless AI are contained within this directory. Upgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers." }, { "output": " .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\n1." }, { "output": " 2. Run the self-extracting archive for the new version of Driverless AI. 3. Port any previous changes you made to your config.toml file to the newly unpacked directory." }, { "output": " Copy the tmp directory (which contains all the Driverless AI working state) from your previous Driverless AI installation into the newly unpacked directory." }, { "output": " Experiment Settings\n=\n\nThis section includes settings that can be used to customize the experiment like total runtime, reproducibility level, pipeline building, feature brain control, adding config.toml settings and more." }, { "output": " This is equivalent to pushing the Finish button once half of the specified time value has elapsed. Note that the overall enforced runtime is only an approximation." }, { "output": " The Finish button will be automatically selected once 12 hours have elapsed, and Driverless AI will subsequently attempt to complete the overall experiment in the remaining 12 hours." }, { "output": " Note that this setting applies to per experiment so if building leaderboard models(n) it will apply to each experiment separately(i.e total allowed runtime will be n*24hrs." }, { "output": " This option preserves experiment artifacts that have been generated for the summary and log zip files while continuing to generate additional artifacts." }, { "output": " Note that this setting applies to per experiment so if building leaderboard models( say n), it will apply to each experiment separately(i.e total allowed runtime will be n*7days." }, { "output": " Also see :ref:`time_abort `. .. _time_abort:\n\n``time_abort``\n\n\n.. dropdown:: Time to Trigger the 'Abort' Button\n\t:open:\n\n\tIf the experiment is not done by this time, push the abort button." }, { "output": " Also see :ref:`max_runtime_minutes_until_abort ` for control over per experiment abort times." }, { "output": " User can also specify integer seconds since 1970-01-01 00:00:00 UTC. This will apply to the time on a DAI worker that runs the experiments." }, { "output": " If user clones this experiment to rerun/refit/restart, this absolute time will apply to such experiments or set of leaderboard experiments." }, { "output": " Select from the following:\n\n\t- Auto: Specifies that all models and features are automatically determined by experiment settings, config.toml settings, and the feature engineering effort." }, { "output": " - Only uses GLM or booster as 'giblinear'. - :ref:`Fixed ensemble level ` is set to 0." }, { "output": " - Max feature interaction depth is set to 1 i.e no interactions. - Target transformers is set to 'identity' for regression." }, { "output": " - :ref:`monotonicity_constraints_correlation_threshold ` is set to 0." }, { "output": " - Drops features that are not correlated with target by at least 0.01. See :ref:`monotonicity-constraints-drop-low-correlation-features ` and :ref:`monotonicity-constraints-correlation-threshold `." }, { "output": " - :ref:`Interaction depth ` is set to 1 i.e no multi-feature interactions done to avoid complexity." }, { "output": " The equivalent config.toml parameter is ``recipe=['monotonic_gbm']``. - :ref:`num_as_cat ` feature transformation is disabled." }, { "output": " - Kaggle: Similar to Auto except for the following:\n\n\t\t- Any external validation set is concatenated with the train set, with the target marked as missing." }, { "output": " - Has several config.toml expert options open-up limits. - nlp_model: Only enable NLP BERT models based on PyTorch to process pure text." }, { "output": " For more information, see :ref:`nlp-in-dai`. - included_models = ['TextBERTModel', 'TextMultilingualBERTModel', 'TextXLNETModel', 'TextXLMModel','TextRoBERTaModel', 'TextDistilBERTModel', 'TextALBERTModel', 'TextCamemBERTModel', 'TextXLMRobertaModel']\n\t\t- enable_pytorch_nlp_transformer = 'off'\n\t\t- enable_pytorch_nlp_model = 'on'\n\n\t- nlp_transformer: Only enable PyTorch based BERT transformers that process pure text." }, { "output": " For more information, see :ref:`nlp-in-dai`. - included_transformers = ['BERTTransformer']\n\t\t- excluded_models = ['TextBERTModel', 'TextMultilingualBERTModel', 'TextXLNETModel', 'TextXLMModel','TextRoBERTaModel', 'TextDistilBERTModel', 'TextALBERTModel', 'TextCamemBERTModel', 'TextXLMRobertaModel']\n\t\t- enable_pytorch_nlp_transformer = 'on'\n\t\t- enable_pytorch_nlp_model = 'off'\n\n\t- image_model: Only enable image models that process pure images (ImageAutoModel)." }, { "output": " For more information, see :ref:`image-model`. Notes:\n\n \t\t- This option disables the :ref:`Genetic Algorithm ` (GA)." }, { "output": " - image_transformer: Only enable the ImageVectorizer transformer, which processes pure images. For more information, see :ref:`image-embeddings`." }, { "output": " :ref:`See ` for reference. - gpus_max: Maximize use of GPUs (e.g. use XGBoost, RAPIDS, Optuna hyperparameter search, etc." }, { "output": " Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings. Changing the pipeline building recipe will reset all pipeline building recipe options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline building recipe rules." }, { "output": " To reset recipe behavior, one can switch between 'auto' and the desired mode. This way the new child experiment will use the default settings for the chosen recipe." }, { "output": " This is same as 'on' unless it is a pure NLP or Image experiment. - on: Driverless AI genetic algorithm is used for feature engineering and model tuning and selection." }, { "output": " In the Optuna case, the scores shown in the iteration panel are the best score and trial scores. Optuna mode currently only uses Optuna for XGBoost, LightGBM, and CatBoost (custom recipe)." }, { "output": " - off: When set to 'off', the final pipeline is trained using the default feature engineering and feature selection." }, { "output": " .. _tournament_style:\n\n``tournament_style``\n\n\n.. dropdown:: Tournament Model for Genetic Algorithm\n\t:open:\n\n\tSelect a method to decide which models are best at each iteration." }, { "output": " Choose from the following:\n\n\t- auto: Choose based upon accuracy and interpretability\n\t- uniform: all individuals in population compete to win as best (can lead to all, e.g." }, { "output": " If enable_genetic_algorithm'Optuna', then every individual is self-mutated without any tournament during the :ref:`genetic algorithm `." }, { "output": " ``make_python_scoring_pipeline``\n\n\n.. dropdown:: Make Python Scoring Pipeline\n\t:open:\n\n\tSpecify whether to automatically build a Python Scoring Pipeline for the experiment." }, { "output": " Select Off to disable the automatic creation of the Python Scoring Pipeline. ``make_mojo_scoring_pipeline``\n\n\n.. dropdown:: Make MOJO Scoring Pipeline\n\t:open:\n\n\tSpecify whether to automatically build a MOJO (Java) Scoring Pipeline for the experiment." }, { "output": " With this option, any capabilities that prevent the creation of the pipeline are dropped. Select Off to disable the automatic creation of the MOJO Scoring Pipeline." }, { "output": " ``mojo_for_predictions``\n\n\n.. dropdown:: Allow Use of MOJO for Making Predictions\n\t:open:\n\n\tSpecify whether to use MOJO for making fast, low-latency predictions after the experiment has finished." }, { "output": " .. _reduce_mojo_size:\n\n``reduce_mojo_size``\n~\n.. dropdown:: Attempt to Reduce the Size of the MOJO (Small MOJO)\n\t:open:\n\n\tSpecify whether to attempt to create a small MOJO scoring pipeline when the experiment is being built." }, { "output": " This setting attempts to reduce the mojo size by limiting experiment's maximum :ref:`interaction depth ` to 3, setting :ref:`ensemble level ` to 0 i.e no ensemble model for final pipeline and limiting the :ref:`maximum number of features ` in the model to 200." }, { "output": " This is disabled by default. The equivalent config.toml setting is ``reduce_mojo_size``\n\n``make_pipeline_visualization``\n\n\n.. dropdown:: Make Pipeline Visualization\n\t:open:\n\n\tSpecify whether to create a visualization of the scoring pipeline at the end of an experiment." }, { "output": " Note that the Visualize Scoring Pipeline feature is experimental and is not available for deprecated models." }, { "output": " ``benchmark_mojo_latency``\n\n\n.. dropdown:: Measure MOJO Scoring Latency\n\t:open:\n\n\tSpecify whether to measure the MOJO scoring latency at the time of MOJO creation." }, { "output": " In this case, MOJO scoring latency will be measured if the pipeline.mojo file size is less than 100 MB." }, { "output": " If the MOJO creation process times out, a MOJO can still be made from the GUI or the R and Python clients (the timeout constraint is not applied to these)." }, { "output": " ``mojo_building_parallelism``\n~\n\n.. dropdown:: Number of Parallel Workers to Use During MOJO Creation\n\t:open:\n\n\tSpecify the number of parallel workers to use during MOJO creation." }, { "output": " Set this value to -1 (default) to use all physical cores. ``kaggle_username``\n~\n\n.. dropdown:: Kaggle Username\n\t:open:\n\n\tOptionally specify your Kaggle username to enable automatic submission and scoring of test set predictions." }, { "output": " If you don't have a Kaggle account, you can sign up at https://www.kaggle.com. ``kaggle_key``\n\n\n.. dropdown:: Kaggle Key\n\t:open:\n\n\tSpecify your Kaggle API key to enable automatic submission and scoring of test set predictions." }, { "output": " For more information on obtaining Kaggle API credentials, see https://github.com/Kaggle/kaggle-api#api-credentials." }, { "output": " This value defaults to 120 sec. ``min_num_rows``\n\n\n.. dropdown:: Min Number of Rows Needed to Run an Experiment\n\t:open:\n\n\tSpecify the minimum number of rows that a dataset must contain in order to run an experiment." }, { "output": " .. _reproducibility_level:\n\n``reproducibility_level``\n~\n\n.. dropdown:: Reproducibility Level\n\t:open:\n\n\tSpecify one of the following levels of reproducibility." }, { "output": " ``seed``\n\n\n.. dropdown:: Random Seed\n\t:open:\n\n\tSpecify a random seed for the experiment. When a seed is defined and the reproducible button is enabled (not by default), the algorithm will behave deterministically." }, { "output": " Specify whether to enable full cross-validation (multiple folds) during feature evolution as opposed to a single holdout split." }, { "output": " ``save_validation_splits``\n\n\n.. dropdown:: Store Internal Validation Split Row Indices\n\t:open:\n\n\tSpecify whether to store internal validation split row indices." }, { "output": " Enable this setting for debugging purposes. This setting is disabled by default. ``max_num_classes``\n~\n\n.. dropdown:: Max Number of Classes for Classification Problems\n\t:open:\n\n\tSpecify the maximum number of classes to allow for a classification problem." }, { "output": " Memory requirements also increase with a higher number of classes. This value defaults to 200. ``max_num_classes_compute_roc``\n~\n\n.. dropdown:: Max Number of Classes to Compute ROC and Confusion Matrix for Classification Problems\n\n\tSpecify the maximum number of classes to use when computing the ROC and CM." }, { "output": " This value defaults to 200 and cannot be lower than 2. ``max_num_classes_client_and_gui``\n\n\n.. dropdown:: Max Number of Classes to Show in GUI for Confusion Matrix\n\t:open:\n\n\tSpecify the maximum number of classes to show in the GUI for CM, showing first ``max_num_classes_client_and_gui`` labels." }, { "output": " Note that if this value is changed in the config.toml and the server is restarted, then this setting will only modify client-GUI launched diagnostics." }, { "output": " ``roc_reduce_type``\n~\n\n.. dropdown:: ROC/CM Reduction Technique for Large Class Counts\n\t:open:\n\n\tSpecify the ROC confusion matrix reduction technique used for large class counts:\n\n\t- Rows (Default): Reduce by randomly sampling rows\n\t- Classes: Reduce by truncating classes to no more than the value specified by ``max_num_classes_compute_roc``\n\n``max_rows_cm_ga``\n\n\n.. dropdown:: Maximum Number of Rows to Obtain Confusion Matrix Related Plots During Feature Evolution\n\t:open:\n\n\tSpecify the maximum number of rows to obtain confusion matrix related plots during feature evolution." }, { "output": " ``use_feature_brain_new_experiments``\n~\n\n.. dropdown:: Whether to Use Feature Brain for New Experiments\n\t:open:\n\n\tSpecify whether to use feature_brain results even if running new experiments." }, { "output": " Even rescoring may be insufficient, so by default this is False. For example, one experiment may have training=external validation by accident, and get high score, and while feature_brain_reset_score='on' means we will rescore, it will have already seen during training the external validation and leak that data as part of what it learned from." }, { "output": " .. _feature_brain1:\n\n``feature_brain_level``\n~\n\n.. dropdown:: Model/Feature Brain Level\n\t:open:\n\n\tSpecify whether to use H2O.ai brain, which enables local caching and smart re-use (checkpointing) of prior experiments to generate useful features and models for new experiments." }, { "output": " When enabled, this will use the H2O.ai brain cache if the cache file:\n\n\t - has any matching column names and types for a similar experiment type\n\t - has classes that match exactly\n\t - has class labels that match exactly\n\t - has basic time series choices that match\n\t - the interpretability of the cache is equal or lower\n\t - the main model (booster) is allowed by the new experiment\n\n\t- -1: Don't use any brain cache (default)\n\t- 0: Don't use any brain cache but still write to cache." }, { "output": " - 1: Smart checkpoint from the latest best individual model. Use case: Want to use the latest matching model." }, { "output": " - 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time series options identically." }, { "output": " - 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size." }, { "output": " - 4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient size." }, { "output": " - 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations to get the best scored individuals." }, { "output": " When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the default maximum brain size is 20GB." }, { "output": " This value defaults to 2. .. _feature_brain2:\n\n``feature_brain2``\n\n\n.. dropdown:: Feature Brain Save Every Which Iteration\n\t:open:\n\n\tSave feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration 0, to be able to restart/refit with which_iteration_brain >= 0." }, { "output": " - -1: Don't use any brain cache. - 0: Don't use any brain cache but still write to cache. - 1: Smart checkpoint if an old experiment_id is passed in (for example, via running \"resume one like this\" in the GUI)." }, { "output": " (default)\n\t- 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size." }, { "output": " Tune only if the brain population is of insufficient size. - 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations (starting from resumed experiment if chosen) in order to get the best scored individuals." }, { "output": " In addition, the default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file." }, { "output": " Available options include:\n\n\t- -1: Use the last best\n\t- 1: Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number\n\t- 2: Identify which iteration brain dump you wants to restart/refit from\n\t- 3: Restart/Refit from the original experiment, setting which_iteration_brain to that number here in expert settings." }, { "output": " This value defaults to -1. .. _feature_brain4:\n\n``feature_brain4``\n\n\n.. dropdown:: Feature Brain Refit Uses Same Best Individual\n\t:open:\n\n\tSpecify whether to use the same best individual when performing a refit." }, { "output": " Enabling this setting lets you view the exact same model or feature with only one new feature added." }, { "output": " .. _feature_brain5:\n\n``feature_brain5``\n\n\n.. dropdown:: Feature Brain Adds Features with New Columns Even During Retraining of Final Model\n\t:open:\n\n\tSpecify whether to add additional features from new columns to the pipeline, even when performing a retrain of the final model." }, { "output": " New data may lead to new dropped features due to shift or leak detection. Disable this to avoid adding any columns as new features so that the pipeline is perfectly preserved when changing data." }, { "output": " ``force_model_restart_to_defaults``\n~\n\n.. dropdown:: Restart-Refit Use Default Model Settings If Model Switches\n\t:open:\n\n\tWhen restarting or refitting, specify whether to use the model class's default settings if the original model class is no longer available." }, { "output": " (Note that this may result in errors.) This is enabled by default. ``min_dai_iterations``\n\n\n.. dropdown:: Min DAI Iterations\n\t:open:\n\n\tSpecify the minimum number of Driverless AI iterations for an experiment." }, { "output": " This value defaults to 0. .. _target_transformer:\n\n``target_transformer``\n\n\n.. dropdown:: Select Target Transformation of the Target for Regression Problems\n\t:open:\n\n\tSpecify whether to automatically select target transformation for regression problems." }, { "output": " Selecting identity_noclip automatically turns off any target transformations. All transformers except for center, standardize, identity_noclip and log_noclip perform clipping to constrain the predictions to the domain of the target in the training data, so avoid them if you want to enable extrapolations." }, { "output": " ``fixed_num_folds_evolution``\n~\n\n.. dropdown:: Number of Cross-Validation Folds for Feature Evolution\n\t:open:\n\n\tSpecify the fixed number of cross-validation folds (if >= 2) for feature evolution." }, { "output": " This value defaults to -1 (auto). ``fixed_num_folds``\n~\n\n.. dropdown:: Number of Cross-Validation Folds for Final Model\n\t:open:\n\n\tSpecify the fixed number of cross-validation folds (if >= 2) for the final model." }, { "output": " This value defaults to -1 (auto). ``fixed_only_first_fold_model``\n~\n\n.. dropdown:: Force Only First Fold for Models\n\t:open:\n\n\tSpecify whether to force only the first fold for models." }, { "output": " Set \"on\" to force only first fold for models.This is useful for quick runs regardless of data\n\n``feature_evolution_data_size``\n~\n\n.. dropdown:: Max Number of Rows Times Number of Columns for Feature Evolution Data Splits\n\t:open:\n\n\tSpecify the maximum number of rows allowed for feature evolution data splits (not for the final pipeline)." }, { "output": " ``final_pipeline_data_size``\n\n\n.. dropdown:: Max Number of Rows Times Number of Columns for Reducing Training Dataset\n\t:open:\n\n\tSpecify the upper limit on the number of rows times the number of columns for training the final pipeline." }, { "output": " ``max_validation_to_training_size_ratio_for_final_ensemble``\n\n\n.. dropdown:: Maximum Size of Validation Data Relative to Training Data\n\t:open:\n\n\tSpecify the maximum size of the validation data relative to the training data." }, { "output": " Note that final model predictions and scores will always be provided on the full dataset provided. This value defaults to 2.0." }, { "output": " If the threshold is not exceeded, random sampling is performed. This value defaults to 0.01. You can choose to always perform random sampling by setting this value to 0, or to always perform stratified sampling by setting this value to 1." }, { "output": " (Refer to the :ref:`sample-configtoml` section to view options that can be overridden during an experiment.)" }, { "output": " Separate multiple config overrides with ``\\n``. For example, the following enables Poisson distribution for LightGBM and disables Target Transformer Tuning." }, { "output": " ::\n\n\t params_lightgbm=\\\"{'objective':'poisson'}\\\" \\n target_transformer=identity\n\n\tOr you can specify config overrides similar to the following without having to escape double quotes:\n\n\t::\n\n\t \"\"enable_glm=\"off\" \\n enable_xgboost_gbm=\"off\" \\n enable_lightgbm=\"off\" \\n enable_tensorflow=\"on\"\"\"\n\t \"\"max_cores=10 \\n data_precision=\"float32\" \\n max_rows_feature_evolution=50000000000 \\n ensemble_accuracy_switch=11 \\n feature_engineering_effort=1 \\n target_transformer=\"identity\" \\n tournament_feature_style_accuracy_switch=5 \\n params_tensorflow=\"{'layers': [100, 100, 100, 100, 100, 100]}\"\"\"\n\n\tWhen running the Python client, config overrides would be set as follows:\n\n\t::\n\n\t\tmodel = h2o.start_experiment_sync(\n\t\t dataset_key=train.key,\n\t\t target_col='target',\n\t\t is_classification=True,\n\t\t accuracy=7,\n\t\t time=5,\n\t\t interpretability=1,\n\t\t config_overrides=\"\"\"\n\t\t feature_brain_level=0\n\t\t enable_lightgbm=\"off\"\n\t\t enable_xgboost_gbm=\"off\"\n\t\t enable_ftrl=\"off\"\n\t\t \"\"\"\n\t\t)\n\n``last_recipe``\n~\n\n.. dropdown:: last_recipe\n\t:open:\n\n\tInternal helper to allow memory of if changed recipe\n\n``feature_brain_reset_score``\n~\n\n.. dropdown:: Whether to re-score models from brain cache\n\t:open:\n\n\tSpecify whether to smartly keep score to avoid re-munging/re-training/re-scoring steps brain models ('auto'), always force all steps for all brain imports ('on'), or never rescore ('off')." }, { "output": " 'on' is useful when smart similarity checking is not reliable enough. 'off' is useful when know want to keep exact same features and model for final model refit, despite changes in seed or other behaviors in features that might change the outcome if re-scored before reaching final model." }, { "output": " Can also set refit_same_best_individual True if want exact same best individual (highest scored model+features) to be used regardless of any scoring changes." }, { "output": " Set to 0 to disable this setting. ``which_iteration_brain``\n~\n\n.. dropdown:: Feature Brain Restart from which iteration\n\t:open:\n\n\tWhen performing restart or re-fit type feature_brain_level with resumed_experiment_id, choose which iteration to start from, instead of only last best -1 means just use last best." }, { "output": " ``refit_same_best_individual``\n\n\n.. dropdown:: Feature Brain refit uses same best individual\n\t:open:\n\n\tWhen doing re-fit from feature brain, if change columns or features, population of individuals used to refit from may change order of which was best, leading to better result chosen (False case)." }, { "output": " That is, if refit with just 1 extra column and have interpretability=1, then final model will be same features, with one more engineered feature applied to that new original feature." }, { "output": " However, in other cases, if data and all options are nearly (or exactly) identical, then these steps might change the features slightly (e.g." }, { "output": " By default, restart and refit avoid these steps assuming data and experiment setup have no changed significantly." }, { "output": " In order to ensure exact same final pipeline is fitted, one should also set:\n\n\t- 1) brain_add_features_for_new_columns false\n\t- 2) refit_same_best_individual true\n\t- 3) feature_brain_reset_score 'off'\n\t- 4) force_model_restart_to_defaults false\n\n\tThe score will still be reset if the experiment metric chosen changes, but changes to the scored model and features will be more frozen in place." }, { "output": " In some cases, one might have a new dataset but only want to keep same pipeline regardless of new columns, in which case one sets this to False." }, { "output": " To avoid change of feature set, one can disable all dropping of columns, but set this to False to avoid adding any columns as new features, so pipeline is perfectly preserved when changing data." }, { "output": " If False, then try to keep original hyperparameters, which can fail to work in general. ``dump_modelparams_every_scored_indiv``\n~\n\n.. dropdown:: Enable detailed scored model info\n\t:open:\n\n\tWhether to dump every scored individual's model parameters to csv/tabulated/json file produces files." }, { "output": " [txt, csv, json]\n\n.. _fast-approx-trees:\n\n``fast_approx_num_trees``\n~\n\n.. dropdown:: Max number of trees to use for fast approximation\n\t:open:\n\n\tWhen ``fast_approx=True``, specify the maximum number of trees to use." }, { "output": " .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions." }, { "output": " By default, this setting is enabled. .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions." }, { "output": " By default, this setting is disabled. .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions." }, { "output": " By default, this value is 50. .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc." }, { "output": " By default, this setting is enabled. .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc." }, { "output": " By default, this setting is enabled. .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc." }, { "output": " .. _linux-rpms:\n\nLinux RPMs\n\n\nFor Linux machines that will not use the Docker image or DEB, an RPM installation is available for the following environments:\n\n- x86_64 RHEL 7 / RHEL 8\n- CentOS 7 / CentOS 8\n\nThe installation steps assume that you have a license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " - When using systemd, remove the ``dai-minio``, ``dai-h2o``, ``dai-redis``, ``dai-procsy``, and ``dai-vis-server`` services." }, { "output": " Note that if you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02\n- OpenCL (Required for full LightGBM support on GPU-powered systems)\n- Driverless AI RPM, available from https://www.h2o.ai/download/\n\nNote: CUDA 11.2.2 (for GPUs) and cuDNN (required for TensorFlow support on GPUs) are included in the Driverless AI package." }, { "output": " To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used." }, { "output": " Installing Driverless AI\n\n\nRun the following commands to install the Driverless AI RPM. .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI." }, { "output": " You can optionally specify a different service user and group as shown below. Replace and as appropriate." }, { "output": " # rpm saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files. sudo DAI_USER=myuser DAI_GROUP=mygroup rpm -i |VERSION-rpm-lin|\n\nYou may now optionally make changes to /etc/dai/config.toml." }, { "output": " sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Start Driverless AI." }, { "output": " This command needs to be run every reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " sudo systemctl stop dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\nUpgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers." }, { "output": " .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time. # Upgrade and restart." }, { "output": " sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " sudo rpm -U |VERSION-rpm-lin|\n sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nUninstalling Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall. sudo rpm -e dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall. sudo rpm -e dai\n\nCAUTION! At this point you can optionally completely remove all remaining files, including the database." }, { "output": " .. code-block:: bash\n\n sudo rm -rf /opt/h2oai/dai\n sudo rm -rf /etc/dai\n\nNote: The UID and GID are not removed during the uninstall process." }, { "output": " .. _linux-deb:\n\nLinux DEBs\n\n\nFor Linux machines that will not use the Docker image or RPM, a deb installation is available for x86_64 Ubuntu 16.04/18.04/20.04/22.04." }, { "output": " For information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/." }, { "output": " .. note::\n\t- To ensure that :ref:`AutoDoc ` pipeline visualizations are generated correctly on native installations, installing `fontconfig `_ is recommended." }, { "output": " When upgrading, you can use the following commands to deactivate these services:\n\n ::\n\n systemctl stop dai-minio\n systemctl disable dai-minio\n systemctl stop dai-h2o\n systemctl disable dai-h2o\n systemctl stop dai-redis\n systemctl disable dai-redis\n systemctl stop dai-procsy\n systemctl disable dai-procsy\n systemctl stop dai-vis-server\n systemctl disable dai-vis-server\n\nEnvironment\n~\n\n+-+-+\n| Operating System | Min Mem |\n+=+=+\n| Ubuntu with GPUs | 64 GB |\n+-+-+\n| Ubuntu with CPUs | 64 GB |\n+-+-+\n\nRequirements\n\n\n- Ubuntu 16.04/Ubuntu 18.04/Ubuntu 20.04/Ubuntu 22.04\n- NVIDIA drivers >= |NVIDIA-driver-ver| is recommended (GPU only)." }, { "output": " About the Install\n~\n\n.. include:: linux-rpmdeb-about.frag\n\nStarting NVIDIA Persistence Mode (GPU only)\n~\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command." }, { "output": " For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\nInstalling OpenCL\n~\n\nOpenCL is required for full LightGBM support on GPU-powered systems." }, { "output": " CUDA LightGBM is only supported on Pascal-powered (and later) systems, and can be enabled manually with the ``enable_lightgbm_cuda_support`` config.toml setting." }, { "output": " .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. sudo dpkg -i |VERSION-deb-lin|\n\nBy default, the Driverless AI processes are owned by the 'dai' user and 'dai' group." }, { "output": " Replace and as appropriate. .. code-block:: bash\n :substitutions:\n\n # Temporarily specify service user and group when installing Driverless AI." }, { "output": " sudo DAI_USER=myuser DAI_GROUP=mygroup dpkg -i |VERSION-deb-lin|\n\nYou may now optionally make changes to /etc/dai/config.toml." }, { "output": " sudo systemctl start dai\n\nNote: If you don't have systemd, refer to :ref:`linux-tarsh` for install instructions." }, { "output": " sudo systemctl stop dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n\nUpgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers." }, { "output": " .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI." }, { "output": " # Upgrade Driverless AI. sudo dpkg -i |VERSION-deb-lin|\n sudo systemctl daemon-reload\n sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time. If you do not, all previous data will be lost." }, { "output": " sudo dpkg -i |VERSION-deb-lin|\n sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nUninstalling Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall Driverless AI. sudo dpkg -r dai\n\n # Purge Driverless AI." }, { "output": " sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Uninstall Driverless AI." }, { "output": " sudo dpkg -P dai\n\nCAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot be undone):\n\n.. code-block:: bash\n\n sudo rm -rf /opt/h2oai/dai\n sudo rm -rf /etc/dai\n\nNote: The UID and GID are not removed during the uninstall process." }, { "output": " However, we DO NOT recommend removing the UID and GID if you plan to re-install Driverless AI. If you remove the UID and GID and then reinstall Driverless AI, the UID and GID will likely be re-assigned to a different (unrelated) user/group in the future; this may cause confusion if there are any remaining files on the filesystem referring to the deleted user or group." }, { "output": " This problem is caused by the font ``NotoColorEmoji.ttf``, which cannot be processed by the Python matplotlib library." }, { "output": " (Do not use fontconfig because it is ignored by matplotlib.) The following will print out the command that should be executed." }, { "output": " .. _install-on-nvidia-dgx:\n\nInstall on NVIDIA GPU Cloud/NGC Registry\n\n\nDriverless AI is supported on the following NVIDIA DGX products, and the installation steps for each platform are the same." }, { "output": " Driverless AI is only available in the NGC registry for DGX machines. 1. Log in to your NVIDIA GPU Cloud account at https://ngc.nvidia.com/registry." }, { "output": " 2. In the Registry > Partners menu, select h2oai-driverless. .. image:: ../images/ngc_select_dai.png\n :align: center\n\n3." }, { "output": " .. image:: ../images/ngc_select_tag.png\n :align: center\n\n4. On your NVIDIA DGX machine, open a command prompt and use the specified pull command to retrieve the Driverless AI image." }, { "output": " Set up a directory for the version of Driverless AI on the host machine: \n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n6." }, { "output": " At this point, you can copy data into the data directory on the host machine. The data will be visible inside the Docker container." }, { "output": " Enable persistence of the GPU. Note that this only needs to be run once. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Run ``docker images`` to find the new image tag. 10. Start the Driverless AI Docker image and replace TAG below with the image tag." }, { "output": " Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n11." }, { "output": " Upgrading Driverless AI\n~\n\nThe steps for upgrading Driverless AI on an NVIDIA DGX system are similar to the installation steps." }, { "output": " Requirements\n\n\nAs of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA drivers >= 440.82 installed (GPU only)." }, { "output": " Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series driver." }, { "output": " On your NVIDIA DGX machine, create a directory for the new Driverless AI version. 2. Copy the data, log, license, and tmp directories from the previous Driverless AI directory into the new Driverless AI directory." }, { "output": " Run ``docker pull nvcr.io/h2oai/h2oai-driverless-ai:latest`` to retrieve the latest Driverless AI version." }, { "output": " AWS Role-Based Authentication\n~\n\nIn Driverless AI, it is possible to enable role-based authentication via the `IAM role `__." }, { "output": " AWS IAM Setup\n'\n\n1. Create an IAM role. This IAM role should have a Trust Relationship with Principal Trust Entity set to your Account ID." }, { "output": " Create a new policy that lets users assume the role:\n\n .. image:: ../images/aws_iam_policy_create.png\n\n3." }, { "output": " .. image:: ../images/aws_iam_policy_assign.png\n\n4. Test role switching here: https://signin.aws.amazon.com/switchrole." }, { "output": " Driverless AI Setup\n'\n\nUpdate the ``aws_use_ec2_role_credentials`` config variable in the config.toml file or start Driverless AI using the ``AWS_USE_EC2_ROLE_CREDENTIALS`` environment variable." }, { "output": " Granting a User Permissions to Switch Roles: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_permissions-to-switch.html\n2." }, { "output": " .. _system-settings:\n\nSystem Settings\n=\n\n.. _exclusive_mode:\n\n``exclusive_mode``\n\n\n.. dropdown:: Exclusive level of access to node resources\n\t:open:\n\n\tThere are three levels of access:\n\n\t\t- safe: this level assumes that there might be another experiment also running on same node." }, { "output": " - max: this level assumes that there is absolutly nothing else running on the node except the experiment\n\n\tThe default level is \"safe\" and the equivalent config.toml parameter is ``exclusive_mode``." }, { "output": " Each exclusive mode can be chosen, and then fine-tuned using each expert settings. Changing the exclusive mode will reset all exclusive mode related options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of exclusive mode rules." }, { "output": " To reset mode behavior, one can switch between 'safe' and the desired mode. This way the new child experiment will use the default system resources for the chosen mode." }, { "output": " Note that if you specify 0, all available cores will be used. Lower values can reduce memory usage but might slow down the experiment." }, { "output": " One can also set it using the environment variable OMP_NUM_THREADS or OPENBLAS_NUM_THREADS (e.g., in bash: 'export OMP_NUM_THREADS=32' or 'export OPENBLAS_NUM_THREADS=32')\n\n``max_fit_cores``\n~\n\n.. dropdown:: Maximum Number of Cores to Use for Model Fit\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's fit call." }, { "output": " This value defaults to 10. .. _use_dask_cluster:\n\n``use_dask_cluster``\n\n\n.. dropdown:: If full dask cluster is enabled, use full cluster\n\t:open:\n\n\tSpecify whether to use full multinode distributed cluster (True) or single-node dask (False)." }, { "output": " E.g. several DGX nodes can be more efficient, if used one DGX at a time for medium-sized data. The equivalent config.toml parameter is ``use_dask_cluster``." }, { "output": " Note that if you specify 0, all available cores will be used. This value defaults to 0(all). ``max_predict_cores_in_dai``\n\n\n.. dropdown:: Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, AutoDoc\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's transform and predict call when doing operations in the Driverless AI MLI GUI and the Driverless AI R and Python clients." }, { "output": " This value defaults to 4. ``batch_cpu_tuning_max_workers``\n\n\n.. dropdown:: Tuning Workers per Batch for CPU\n\t:open:\n\n\tSpecify the number of workers used in CPU mode for tuning." }, { "output": " This value defaults to 0(socket count). ``cpu_max_workers``\n~\n.. dropdown:: Number of Workers for CPU Training\n\t:open:\n\n\tSpecify the number of workers used in CPU mode for training:\n\n\t- 0: Use socket count (Default)\n\t- -1: Use all physical cores >= 1 that count\n\n.. _num_gpus_per_experiment:\n\n``num_gpus_per_experiment``\n~\n\n.. dropdown:: #GPUs/Experiment\n\t:open:\n\n\tSpecify the number of GPUs to use per experiment." }, { "output": " Must be at least as large as the number of GPUs to use per model (or -1). In multinode context when using dask, this refers to the per-node value." }, { "output": " In order to have a sufficient number of cores per GPU, this setting limits the number of GPUs used." }, { "output": " .. _num-gpus-per-model:\n\n``num_gpus_per_model``\n\n.. dropdown:: #GPUs/Model\n\t:open:\n\n\tSpecify the number of GPUs to user per model." }, { "output": " Currently num_gpus_per_model other than 1 disables GPU locking, so is only recommended for single experiments and single users." }, { "output": " In all cases, XGBoost tree and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1 GPU/model and run multiple models on multiple GPUs." }, { "output": " Rulefit uses GPUs for parts involving obtaining the tree using LightGBM. In multinode context when using dask, this parameter refers to the per-node value." }, { "output": " of GPUs for Isolated Prediction/Transform\n\t:open:\n\n\tSpecify the number of GPUs to use for ``predict`` for models and ``transform`` for transformers when running outside of ``fit``/``fit_transform``." }, { "output": " New processes will use this count for applicable models and transformers. Note that enabling ``tensorflow_nlp_have_gpus_in_production`` will override this setting for relevant TensorFlow NLP transformers." }, { "output": " Note: When GPUs are used, TensorFlow, PyTorch models and transformers, and RAPIDS always predict on GPU." }, { "output": " In multinode context when using dask, this refers to the per-node value. ``gpu_id_start``\n\n\n.. dropdown:: GPU Starting ID\n\t:open:\n\n\tSpecify Which gpu_id to start with." }, { "output": " For example, if ``CUDA_VISIBLE_DEVICES='4,5'`` then ``gpu_id_start=0`` will refer to device #4. From expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs, then:\n\n\t- Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0\n\t- Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1\n\n\tFrom expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs, then:\n\n\t- Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0\n\t- Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4\n\n\tTo run on all 4 GPUs/model, then\n\n\t- Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0\n\t- Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4\n\n\tIf num_gpus_per_model!=1, global GPU locking is disabled." }, { "output": " More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation\n\tNote that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile GPUs." }, { "output": " For actual use beyond this value, system will start to have slow-down issues. THe default value is 3." }, { "output": " ``max_dt_threads_munging``\n\n\n.. dropdown:: Max Number of Threads to Use for datatable and OpenBLAS for Munging and Model Training\n\t:open:\n\n\tSpecify the maximum number of threads to use for datatable and OpenBLAS during data munging (applied on a per process basis):\n\n\t- 0 = Use all threads\n\t- -1 = Automatically select number of threads (Default)\n\n``max_dt_threads_readwrite``\n\n\n.. dropdown:: Max Number of Threads to Use for datatable Read and Write of Files\n\t:open:\n\n\tSpecify the maximum number of threads to use for datatable during data reading and writing (applied on a per process basis):\n\n\t- 0 = Use all threads\n\t- -1 = Automatically select number of threads (Default)\n\n``max_dt_threads_stats_openblas``\n~\n\n.. dropdown:: Max Number of Threads to Use for datatable Stats and OpenBLAS\n\t:open:\n\n\tSpecify the maximum number of threads to use for datatable stats and OpenBLAS (applied on a per process basis):\n\n\t- 0 = Use all threads\n\t- -1 = Automatically select number of threads (Default)\n\n.. _allow_reduce_features_when_failure:\n\n``allow_reduce_features_when_failure``\n\n\n.. dropdown:: Whether to reduce features when model fails (GPU OOM Protection)\n\t:open:\n\n\tBig models (on big data or with lot of features) can run out of memory on GPUs." }, { "output": " Currently is applicable to all non-dask XGBoost models (i.e. GLMModel, XGBoostGBMModel, XGBoostDartModel, XGBoostRFModel),during normal fit or when using Optuna." }, { "output": " For example, If XGBoost runs out of GPU memory, this is detected, and (regardless of setting of skip_model_failures), we perform feature selection using XGBoost on subsets of features." }, { "output": " This splitting continues until no failure occurs. Then all sub-models are used to estimate variable importance by absolute information gain, in order to decide which features to include." }, { "output": " Note:\n\n\t- This option is set to 'auto' -> 'on' by default i.e whenever the conditions are favorable, it is set to 'on'." }, { "output": " Hence if user enables reproducibility for the experiment, 'auto' automatically sets this option to 'off'." }, { "output": " - Reduction is only done on features and not on rows for the feature selection step. Also see :ref:`reduce_repeats_when_failure ` and :ref:`fraction_anchor_reduce_features_when_failure `\n\n.. _reduce_repeats_when_failure:\n\n``reduce_repeats_when_failure``\n~\n\n.. dropdown:: Number of repeats for models used for feature selection during failure recovery\n\t:open:\n\n\tWith :ref:`allow_reduce_features_when_failure `, this controls how many repeats of sub-models are used for feature selection." }, { "output": " More repeats can lead to higher accuracy. The cost of this option is proportional to the repeat count." }, { "output": " .. _fraction_anchor_reduce_features_when_failure:\n\n``fraction_anchor_reduce_features_when_failure``\n\n\n.. dropdown:: Fraction of features treated as anchor for feature selection during failure recovery\n\t:open:\n\n\tWith :ref:`allow_reduce_features_when_failure `, this controls the fraction of features treated as an anchor that are fixed for all sub-models." }, { "output": " For tuning and evolution, the probability depends upon any prior importance (if present) from other individuals, while final model uses uniform probability for anchor features." }, { "output": " ``xgboost_reduce_on_errors_list``\n~\n\n.. dropdown:: Errors From XGBoost That Trigger Reduction of Features\n\t:open:\n\n\tError strings from XGBoost that are used to trigger re-fit on reduced sub-models." }, { "output": " ``lightgbm_reduce_on_errors_list``\n\n\n.. dropdown:: Errors From LightGBM That Trigger Reduction of Features\n\t:open:\n\n\tError strings from LightGBM that are used to trigger re-fit on reduced sub-models." }, { "output": " ``num_gpus_per_hyperopt_dask``\n\n\n.. dropdown:: GPUs / HyperOptDask\n\t:open:\n\n\tSpecify the number of GPUs to use per model hyperopt training task." }, { "output": " For example, when this is set to -1 and there are 4 GPUs available, all of them can be used for the training of a single model across a Dask cluster." }, { "output": " In multinode context, this refers to the per-node value. ``detailed_traces``\n~\n\n.. dropdown:: Enable Detailed Traces\n\t:open:\n\n\tSpecify whether to enable detailed tracing in Driverless AI trace when running an experiment." }, { "output": " ``debug_log``\n~\n\n.. dropdown:: Enable Debug Log Level\n\t:open:\n\n\tIf enabled, the log files will also include debug logs." }, { "output": " ``log_system_info_per_experiment``\n\n\n.. dropdown:: Enable Logging of System Information for Each Experiment\n\t:open:\n\n\tSpecify whether to include system information such as CPU, GPU, and disk space at the start of each experiment log." }, { "output": " The F0.5 score is the weighted harmonic mean of the precision and recall (given a threshold value)." }, { "output": " More weight should be given to precision for cases where False Positives are considered worse than False Negatives." }, { "output": " In this case, you want your predictions to be very precise and only capture the products that will definitely run out." }, { "output": " F05 equation:\n\n.. math::\n\n F0.5 = 1.25 \\;\\Big(\\; \\frac{(precision) \\; (recall)}{((0.25) \\; (precision)) + recall}\\; \\Big)\n\nWhere:\n\n- *precision* is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives)." }, { "output": " S3 Setup\n\n\nDriverless AI lets you explore S3 data sources from within the Driverless AI application." }, { "output": " Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``aws_access_key_id``: The S3 access key ID\n- ``aws_secret_access_key``: The S3 access key\n- ``aws_role_arn``: The Amazon Resource Name\n- ``aws_default_region``: The region to use when the aws_s3_endpoint_url option is not set." }, { "output": " - ``aws_s3_endpoint_url``: The endpoint URL that will be used to access S3. - ``aws_use_ec2_role_credentials``: If set to true, the S3 Connector will try to to obtain credentials associated with the role attached to the EC2 instance." }, { "output": " - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly." }, { "output": " It does not pass any S3 access key or secret; however it configures Docker DNS by passing the name and IP of the S3 name node." }, { "output": " .. code-block:: bash\n\t :substitutions:\n\n\t nvidia-docker run \\\n\t\t\tshm-size=256m \\\n\t\t\tadd-host name.node:172.16.2.186 \\\n\t\t\t-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,s3\" \\\n\t\t\t-p 12345:12345 \\\n\t\t\tinit -it rm \\\n\t\t\t-v /tmp/dtmp/:/tmp \\\n\t\t\t-v /tmp/dlog/:/log \\\n\t\t\t-v /tmp/dlicense/:/license \\\n\t\t\t-v /tmp/ddata/:/data \\\n\t\t\t-u $(id -u):$(id -g) \\\n\t\t\th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n\tThis example shows how to configure S3 options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, s3\"``\n\n\t2." }, { "output": " .. code-block:: bash\n\t \t :substitutions:\n\n\t\t nvidia-docker run \\\n\t\t \tpid=host \\\n\t\t \tinit \\\n\t\t \trm \\\n\t\t \tshm-size=256m \\\n\t\t \tadd-host name.node:172.16.2.186 \\\n\t\t \t-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n\t\t \t-p 12345:12345 \\\n\t\t \t-v /local/path/to/config.toml:/path/in/docker/config.toml \\\n\t\t \t-v /etc/passwd:/etc/passwd:ro \\\n\t\t \t-v /etc/group:/etc/group:ro \\\n\t\t \t-v /tmp/dtmp/:/tmp \\\n\t\t \t-v /tmp/dlog/:/log \\\n\t\t \t-v /tmp/dlicense/:/license \\\n\t\t \t-v /tmp/ddata/:/data \\\n\t\t \t-u $(id -u):$(id -g) \\\n\t\t \th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n\tThis example enables the S3 data connector and disables authentication." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n\t ::\n\n\t # DEB and RPM\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n\t # TAR SH\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n\t2." }, { "output": " ::\n\n\t\t# File System Support\n\t\t# upload : standard upload feature\n\t\t# file : local file system/server file system\n\t\t# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n\t\t# dtap : Blue Data Tap file system, remember to configure the DTap section below\n\t\t# s3 : Amazon S3, optionally configure secret and access key below\n\t\t# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n\t\t# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n\t\t# minio : Minio Cloud Storage, remember to configure secret and access key below\n\t\t# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n\t\t# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n\t\t# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n\t\t# jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n\t\t# recipe_url: load custom recipe from URL\n\t\t# recipe_file: load custom recipe from local file system\n\t\tenabled_file_systems = \"file, s3\"\n\n\t3." }, { "output": " Example 2: Enable S3 with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n\tThis example enables the S3 data connector with authentication by passing an S3 access key ID and an access key." }, { "output": " This allows users to reference data stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, s3\"``\n\t - ``aws_access_key_id = \"\"``\n\t - ``aws_secret_access_key = \"\"``\n\n\t2." }, { "output": " .. code-block:: bash\n\t \t:substitutions:\n\n\t\t nvidia-docker run \\\n\t\t \tpid=host \\\n\t\t \tinit \\\n\t\t \trm \\\n\t\t \tshm-size=256m \\\n\t\t \tadd-host name.node:172.16.2.186 \\\n\t\t \t-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n\t\t \t-p 12345:12345 \\\n\t\t \t-v /local/path/to/config.toml:/path/in/docker/config.toml \\\n\t\t \t-v /etc/passwd:/etc/passwd:ro \\\n\t\t \t-v /etc/group:/etc/group:ro \\\n\t\t \t-v /tmp/dtmp/:/tmp \\\n\t\t \t-v /tmp/dlog/:/log \\\n\t\t \t-v /tmp/dlicense/:/license \\\n\t\t \t-v /tmp/ddata/:/data \\\n\t\t \t-u $(id -u):$(id -g) \\\n\t\t \th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n\tThis example enables the S3 data connector with authentication by passing an S3 access key ID and an access key." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n\t ::\n\n\t # DEB and RPM\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n\t # TAR SH\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n\t2." }, { "output": " ::\n\n\t\t# File System Support\n\t\t# upload : standard upload feature\n\t\t# file : local file system/server file system\n\t\t# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n\t\t# dtap : Blue Data Tap file system, remember to configure the DTap section below\n\t\t# s3 : Amazon S3, optionally configure secret and access key below\n\t\t# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n\t\t# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n\t\t# minio : Minio Cloud Storage, remember to configure secret and access key below\n\t\t# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n\t\t# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n\t\t# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n\t\t# jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n\t\t# recipe_url: load custom recipe from URL\n\t\t# recipe_file: load custom recipe from local file system\n\t\tenabled_file_systems = \"file, s3\"\n\n\t\t# S3 Connector credentials\n\t\taws_access_key_id = \"\"\n\t\taws_secret_access_key = \"\"\n\n\t3." }, { "output": " .. _image-settings:\n\nImage Settings\n\n\n``enable_tensorflow_image``\n~\n.. dropdown:: Enable Image Transformer for Processing of Image Data\n\t:open:\n\n\tSpecify whether to use pretrained deep learning models for processing of image data as part of the feature engineering pipeline." }, { "output": " This is enabled by default. .. _tensorflow_image_pretrained_models:\n\n``tensorflow_image_pretrained_models``\n\n\n.. dropdown:: Supported ImageNet Pretrained Architectures for Image Transformer\n\t:open:\n\n\tSpecify the supported `ImageNet `__ pretrained architectures for image transformer." }, { "output": " If an internet connection is not available, non-default models must be downloaded from http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/dai_image_models_1_10.zip and extracted into ``tensorflow_image_pretrained_models_dir``." }, { "output": " In this case, embeddings from the different architectures are concatenated together (in a single embedding)." }, { "output": " Select from the following:\n\n\t- 10\n\t- 25\n\t- 50\n\t- 100 (Default)\n\t- 200\n\t- 300\n\n\tNote: Multiple transformers can be activated at the same time to allow the selection of multiple options." }, { "output": " This is disabled by default. ``tensorflow_image_fine_tuning_num_epochs``\n~\n.. dropdown:: Number of Epochs for Fine-Tuning Used for the Image Transformer\n\t:open:\n\n\tSpecify the number of epochs for fine-tuning ImageNet pretrained models used for the Image Transformer." }, { "output": " ``tensorflow_image_augmentations``\n\n.. dropdown:: List of Augmentations for Fine-Tuning Used for the Image Transformer\n\t:open:\n\n\tSpecify the list of possible image augmentations to apply while fine-tuning the ImageNet pretrained models used for the Image Transformer." }, { "output": " ``tensorflow_image_batch_size``\n~\n.. dropdown:: Batch Size for the Image Transformer\n\t:open:\n\n\tSpecify the batch size for the Image Transformer." }, { "output": " Note: Larger architectures and batch sizes use more memory. ``image_download_timeout``\n\n.. dropdown:: Image Download Timeout in Seconds\n\t:open:\n\n\tWhen providing images through URLs, specify the maximum number of seconds to wait for an image to download." }, { "output": " ``string_col_as_image_max_missing_fraction``\n\n.. dropdown:: Maximum Allowed Fraction of Missing Values for Image Column\n\t:open:\n\n\tSpecify the maximum allowed fraction of missing elements in a string column for it to be considered as a potential image path." }, { "output": " ``string_col_as_image_min_valid_types_fraction``\n\n.. dropdown:: Minimum Fraction of Images That Need to Be of Valid Types for Image Column to Be Used\n\t:open:\n\n\tSpecify the fraction of unique image URIs that need to have valid endings (as defined by ``string_col_as_image_valid_types``) for a string column to be considered as image data." }, { "output": " ``tensorflow_image_use_gpu``\n\n.. dropdown:: Enable GPU(s) for Faster Transformations With the Image Transformer\n\t:open:\n\n\tSpecify whether to use any available GPUs to transform images into embeddings with the Image Transformer." }, { "output": " Install on RHEL\n-\n\nThis section describes how to install the Driverless AI Docker image on RHEL. The installation steps vary depending on whether your system has GPUs or if it is CPU only." }, { "output": " | Min Mem |\n+=+=+=+\n| RHEL with GPUs | Yes | 64 GB |\n+-+-+-+\n| RHEL with CPUs | No | 64 GB |\n+-+-+-+\n\n.. _install-on-rhel-with-gpus:\n\nInstall on RHEL with GPUs\n~\n\nNote: Refer to the following links for more information about using RHEL with GPUs." }, { "output": " This is necessary in order to prevent a mismatch between the NVIDIA driver and the kernel, which can lead to the GPUs failures." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same." }, { "output": " Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following steps." }, { "output": " Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Install and start Docker EE on RHEL (if not already installed)." }, { "output": " Alternatively, you can run on Docker CE. .. code-block:: bash\n\n sudo yum install -y yum-utils\n sudo yum-config-manager add-repo https://download.docker.com/linux/centos/docker-ce.repo\n sudo yum makecache fast\n sudo yum -y install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " More information is available at https://github.com/NVIDIA/nvidia-docker/blob/master/README.md. .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(." }, { "output": " If you do not run this command, you will have to remember to start the nvidia-docker service manually; otherwise the GPUs will not appear as available." }, { "output": " Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver." }, { "output": " Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n \n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n6." }, { "output": " Enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Set up the data, log, and license directories on the host machine (within the new directory):\n\n .. code-block:: bash\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n9." }, { "output": " The data will be visible inside the Docker container. 10. Run ``docker images`` to find the image tag." }, { "output": " Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " For GPU users, as GPU needs ``pid=host`` for nvml, which makes tini not use pid=1, so it will show the warning message (still harmless)." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n12." }, { "output": " .. _install-on-rhel-cpus-only:\n\nInstall on RHEL with CPUs\n~\n\nThis section describes how to install and start the Driverless AI Docker image on RHEL." }, { "output": " Watch the installation video `here `__." }, { "output": " .. note::\n\tAs of this writing, Driverless AI has been tested on RHEL versions 7.4, 8.3, and 8.4. Open a Terminal and ssh to the machine that will run Driverless AI." }, { "output": " 1. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.com/engine/installation/linux/docker-ee/rhel/." }, { "output": " .. code-block:: bash\n\n sudo yum install -y yum-utils\n sudo yum-config-manager add-repo https://download.docker.com/linux/centos/docker-ce.repo\n sudo yum makecache fast\n sudo yum -y install docker-ce\n sudo systemctl start docker\n\n2." }, { "output": " 3. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n4." }, { "output": " Set up the data, log, license, and tmp directories (within the new directory):\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the directory associated with your version of Driverless AI\n cd |VERSION-dir|\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n6." }, { "output": " The data will be visible inside the Docker container at //data. 7. Run ``docker images`` to find the image tag." }, { "output": " Start the Driverless AI Docker image. Note that GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " HDFS Setup\n\n\nDriverless AI lets you explore HDFS data sources from within the Driverless AI application." }, { "output": " Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``hdfs_config_path`` (Required): The location the HDFS config folder path." }, { "output": " - ``hdfs_auth_type`` (Required): Specifies the HDFS authentication. Available values are:\n\n - ``principal``: Authenticate with HDFS with a principal user." }, { "output": " If running DAI as a service, then the Kerberos keytab needs to be owned by the DAI user. - ``keytabimpersonation``: Login with impersonation using a keytab." }, { "output": " - ``key_tab_path``: The path of the principal key tab file. This is required when ``hdfs_auth_type='principal'``." }, { "output": " This is required when ``hdfs_auth_type='keytab'``. - ``hdfs_app_jvm_args``: JVM args for HDFS distributions." }, { "output": " - ``-Djava.security.krb5.conf``\n - ``-Dsun.security.krb5.debug``\n - ``-Dlog4j.configuration``\n\n- ``hdfs_app_classpath``: The HDFS classpath." }, { "output": " For example:\n\n ::\n\n hdfs_app_supported_schemes = ['hdfs://', 'maprfs://', 'custom://']\n\n The following are the default values for this option." }, { "output": " - ``hdfs://``\n - ``maprfs://``\n - ``swift://``\n\n- ``hdfs_max_files_listed``: Specifies the maximum number of files that are viewable in the connector UI." }, { "output": " To view more files, increase the default value. - ``hdfs_init_path``: Specifies the starting HDFS path displayed in the UI of the HDFS browser." }, { "output": " This must be configured in order for data connectors to function properly. Example 1: Enable HDFS with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the HDFS data connector and disables HDFS authentication." }, { "output": " This lets you reference data stored in HDFS directly using name node address, for example: ``hdfs://name.node/datasets/iris.csv``." }, { "output": " Note that this example enables HDFS with no authentication. 1. Configure the Driverless AI config.toml file." }, { "output": " Note that the procsy port, which defaults to 12347, also has to be changed. - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the HDFS data connector and disables HDFS authentication in the config.toml file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " Note that the procsy port, which defaults to 12347, also has to be changed. ::\n\n # IP address and port of procsy process." }, { "output": " (jdbc_app_configs)\n # hive: Hive Connector, remember to configure Hive below. (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n 3." }, { "output": " Example 2: Enable HDFS with Keytab-Based Authentication\n~\n\nNotes: \n\n- If using Kerberos Authentication, then the time on the Driverless AI server must be in sync with Kerberos server." }, { "output": " - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user; otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authentication and, hence, fail." }, { "output": " - Configures the environment variable ``DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER`` to reference a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " - Configures the option ``hdfs_app_prinicpal_user`` to reference a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed." }, { "output": " Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n \n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n # HDFS connector\n # Auth type can be Principal/keytab/keytabPrincipal\n # Specify HDFS Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with HDFS with a principal user\n # keytab : Authenticate with a Key tab (recommended)\n # keytabimpersonation : Login with impersonation using a keytab\n hdfs_auth_type = \"keytab\"\n\n # Path of the principal key tab file\n key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n hdfs_app_principal_user = \"\"\n\n 3." }, { "output": " Example 3: Enable HDFS with Keytab-Based Impersonation\n\n\nNotes: \n\n- If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server." }, { "output": " - Logins are case sensitive when keytab-based impersonation is configured. .. tabs::\n .. group-tab:: Docker Image Installs\n\n The example:\n\n - Sets the authentication type to ``keytabimpersonation``." }, { "output": " - Configures the ``DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER`` variable, which references a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed." }, { "output": " Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Sets the authentication type to ``keytabimpersonation``." }, { "output": " - Configures the ``hdfs_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n # HDFS connector\n # Auth type can be Principal/keytab/keytabPrincipal\n # Specify HDFS Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with HDFS with a principal user\n # keytab : Authenticate with a Key tab (recommended)\n # keytabimpersonation : Login with impersonation using a keytab\n hdfs_auth_type = \"keytabimpersonation\"\n\n # Path of the principal key tab file\n key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n hdfs_app_principal_user = \"\"\n\n 3." }, { "output": " Specifying a Hadoop Platform\n\n\nThe following example shows how to build an H2O-3 Hadoop image and run Driverless AI." }, { "output": " Change the ``H2O_TARGET`` to specify a different platform. 1. Clone and then build H2O-3 for CDH 6.0." }, { "output": " Start H2O. .. code-block:: bash\n\n docker run -it rm \\\n -v `pwd`:`pwd` \\\n -w `pwd` \\\n entrypoint bash \\\n network=host \\\n -p 8020:8020 \\\n docker.h2o.ai/cdh-6-w-hive \\\n -c 'sudo -E startup.sh && \\\n source /envs/h2o_env_python3.8/bin/activate && \\\n hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars \"$(cat /opt/hive-jars/hive-libjars)\" -n 1 -mapperXmx 2g -baseport 54445 -notify h2o_one_node -ea -disown && \\\n export CLOUD_IP=localhost && \\\n export CLOUD_PORT=54445 && \\\n make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \\\n bash'\n\n3." }, { "output": " .. code-block:: bash\n\n java -cp connectors/hdfs.jar ai.h2o.dai.connectors.HdfsConnector\n\n\n4. Verify the commands for ``ls`` and ``cp``, for example." }, { "output": " .. _running-docker-on-gce:\n\nInstall and Run in a Docker Container on Google Compute Engine\n\n\nThis section describes how to install and start Driverless AI from scratch using a Docker container in a Google Compute environment." }, { "output": " If you don't have an account, go to https://console.cloud.google.com/getting-started to create one." }, { "output": " Watch the installation video `here `__." }, { "output": " Before You Begin\n\n\nIf you are trying GCP for the first time and have just created an account, check your Google Compute Engine (GCE) resource quota limits." }, { "output": " You can change these settings to match your quota limit, or you can request more resources from GCP." }, { "output": " Installation Procedure\n\n\n1. In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/." }, { "output": " In the left navigation panel, select Compute Engine > VM Instances. .. image:: ../images/gce_newvm_instance.png\n :align: center\n :height: 390\n :width: 400\n\n3." }, { "output": " .. image:: ../images/gce_create_instance.png\n :align: center\n\n4. Specify the following at a minimum:\n\n - A unique name for this instance." }, { "output": " Note that not all zones and user accounts can select zones with GPU instances. Refer to the following for information on how to add GPUs: https://cloud.google.com/compute/docs/gpus/." }, { "output": " Be sure to also increase the disk size of the OS image to be 64 GB. Click Create at the bottom of the form when you are done." }, { "output": " .. image:: ../images/gce_instance_settings.png\n :align: center\n :height: 446\n :width: 380\n\n5." }, { "output": " On the Google Cloud Platform left navigation panel, select VPC network > Firewall rules. Specify the following settings:\n\n - Specify a unique name and Description for this instance." }, { "output": " - Specify the Source IP ranges to be ``0.0.0.0/0``. - Under Protocols and Ports, select Specified protocols and ports and enter the following: ``tcp:12345``." }, { "output": " .. image:: ../images/gce_create_firewall_rule.png\n :align: center\n :height: 452\n :width: 477\n\n6." }, { "output": " .. image:: ../images/gce_ssh_in_browser.png\n :align: center\n\n7. H2O provides a script for you to run in your VM instance." }, { "output": " Copy one of the scripts below (depending on whether you are running GPUs or CPUs). Save the script as install.sh." }, { "output": " /etc/os-release;echo $ID$VERSION_ID)\n curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \\\n sudo tee /etc/apt/sources.list.d/nvidia-docker.list\n sudo apt-get update\n\n # Install nvidia-docker2 and reload the Docker daemon configuration\n sudo apt-get install -y nvidia-docker2\n\n .. code-block:: bash\n\n # SCRIPT FOR CPUs ONLY\n apt-get -y update \n apt-get -y no-install-recommends install \\\n curl \\\n apt-utils \\\n python-software-properties \\\n software-properties-common\n\n add-apt-repository -y \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - \n\n apt-get update \n apt-get install -y docker-ce\n\n\n8." }, { "output": " .. code-block:: bash\n\n chmod +x install.sh\n sudo ./install.sh\n\n9. In your user folder, create the following directories as your user." }, { "output": " Add your Google Compute user name to the Docker container. .. code-block:: bash\n\n sudo usermod -aG docker \n\n\n11." }, { "output": " .. code-block:: bash\n\n sudo reboot\n\n12. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/." }, { "output": " Load the Driverless AI Docker image. The following example shows how to load Driverless AI. Replace VERSION with your image." }, { "output": " If you are running CPUs, you can skip this step. Otherwise, you must enable persistence of the GPU." }, { "output": " Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " Note: Use ``docker version`` to check which version of Docker you are using. .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n16." }, { "output": " You can stop the instance using one of the following methods: \n\nStopping in the browser\n\n1. On the VM Instances page, click on the VM instance that you want to stop." }, { "output": " Click Stop at the top of the page. 3. A confirmation page will display. Click Stop to stop the instance." }, { "output": " Azure Blob Store Setup\n \n\nDriverless AI lets you explore Azure Blob Store data sources from within the Driverless AI application." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Supported Data Sources Using the Azure Blob Store Connector\n~\n\nThe following data sources can be used with the Azure Blob Store connector." }, { "output": " - :ref:`Azure Data Lake Gen 1 (HDFS connector required)`\n- :ref:`Azure Data Lake Gen 2 (HDFS connector optional)`\n\n\nDescription of Configuration Attributes\n~\n\nThe following configuration attributes are specific to enabling Azure Blob Storage." }, { "output": " This should be the dns prefix created when the account was created (for example, \"mystorage\"). - ``azure_blob_account_key``: Specify the account key that maps to your account name." }, { "output": " With this option, you can include an override for a host, port, and/or account name. For example, \n\n .. code:: bash\n\n azure_connection_string = \"DefaultEndpointsProtocol=http;AccountName=;AccountKey=;BlobEndpoint=http://:/;\"\n\n- ``azure_blob_init_path``: Specifies the starting Azure Blob store path displayed in the UI of the Azure Blob store browser." }, { "output": " This must be configured in order for data connectors to function properly. The following additional configuration attributes can be used for enabling an HDFS Connector to connect to Azure Data Lake Gen 1 (and optionally with Azure Data Lake Gen 2)." }, { "output": " This folder can contain multiple config files. - ``hdfs_app_classpath``: The HDFS classpath. - ``hdfs_app_supported_schemes``: Supported schemas list is used as an initial check to ensure valid input to connector." }, { "output": " This lets users reference data stored on your Azure storage account using the account name, for example: ``https://mystorage.blob.core.windows.net``." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, azrbs\"``\n - ``azure_blob_account_name = \"mystorage\"``\n - ``azure_blob_account_key = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example shows how to enable the Azure Blob Store data connector in the config.toml file when starting Driverless AI in native installs." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, azrbs\"\n\n # Azure Blob Store Connector credentials\n azure_blob_account_name = \"mystorage\"\n azure_blob_account_key = \"\"\n\n 3." }, { "output": " .. _example2:\n\nExample 2: Mount Azure File Shares to the Local File System\n~\n\nSupported Data Sources Using the Local File System\n\n\n- Azure Files (File Storage) \n\nMounting Azure File Shares\n\n\nAzure file shares can be mounted into the Local File system of Driverless AI." }, { "output": " .. _example3:\n\nExample 3: Enable HDFS Connector to Connect to Azure Data Lake Gen 1\n~\n\nThis example enables the HDFS Connector to connect to Azure Data Lake Gen1." }, { "output": " .. tabs::\n .. group-tab:: Docker Image with the config.toml\n\n 1. Create an Azure AD web application for service-to-service authentication: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n\n 2." }, { "output": " Take note of the Hadoop Classpath and add the ``azure-datalake-store.jar`` file. This file can found on any Hadoop version in: ``$HADOOP_HOME/share/hadoop/tools/lib/*``." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['adl://']\"\n \n 5." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n\n 2." }, { "output": " Take note of the Hadoop Classpath and add the ``azure-datalake-store.jar`` file. This file can found on any hadoop version in: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n\n 4." }, { "output": " Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['adl://']\"\n \n 5." }, { "output": " .. _example4:\n\nExample 4: Enable HDFS Connector to Connect to Azure Data Lake Gen 2\n\n\nThis example enables the HDFS Connector to connect to Azure Data Lake Gen2." }, { "output": " .. tabs::\n .. group-tab:: Docker Image with the config.toml\n\n 1. Create an Azure Service Principal: https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal\n\n 2." }, { "output": " Add the information from your web application to the Hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.azure.account.auth.type\n OAuth\n \n \n fs.azure.account.oauth.provider.type\n org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\n \n \n fs.azure.account.oauth2.client.endpoint\n Token endpoint created in step 1.\n \n \n fs.azure.account.oauth2.client.id\n Client ID created in step 1\n \n \n fs.azure.account.oauth2.client.secret\n Client Secret created in step 1\n \n \n\n 4." }, { "output": " These files can found on any Hadoop version 3.2 or higher at: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n\n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n \n Note: ABFS is only supported for Hadoop version 3.2 or higher." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['abfs://']\"\n \n 6." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal\n\n 2." }, { "output": " Add the information from your web application to the hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.azure.account.auth.type\n OAuth\n \n \n fs.azure.account.oauth.provider.type\n org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\n \n \n fs.azure.account.oauth2.client.endpoint\n Token endpoint created in step 1.\n \n \n fs.azure.account.oauth2.client.id\n Client ID created in step 1\n \n \n fs.azure.account.oauth2.client.secret\n Client Secret created in step 1\n \n \n\n 4." }, { "output": " These files can found on any hadoop version 3.2 or higher at: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n \n Note: ABFS is only supported for hadoop version 3.2 or higher \n\n 5." }, { "output": " Set the following configuration options: \n\n .. code:: bash\n \n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['abfs://']\"\n \n 6." }, { "output": " Export MOJO artifact to Azure Blob Storage\n\n\nIn order to export the MOJO artifact to Azure Blob Storage, you must enable support for the shared access signatures (SAS) token." }, { "output": " ``enable_artifacts_upload=true``\n2. ``artifacts_store=\"azure\"``\n3. ``artifacts_azure_sas_token=\"token\"``\n\nFor instructions on exporting artifacts, see :ref:`export_artifacts`." }, { "output": " Yes. Driverless AI can use private endpoints if Driverless AI is located in the allowed VNET. Does Driverless AI support secure transfer?" }, { "output": " The Azure Blob Store Connector make all connections over HTTPS. Does Driverless AI support hierarchical namespaces?" }, { "output": " Can I use Azure Managed Identities (MSI) to access the DataLake? Yes. If Driverless AI is running on an Azure VM with managed identities." }, { "output": " .. _recipes-settings:\n\nRecipes Settings\n\n\n.. _included_transformers:\n\n``included_transformers``\n\n\n.. dropdown:: Include Specific Transformers\n\t:open:\n\n\tSelect the :ref:`transformer(s) ` that you want to use in the experiment." }, { "output": " Note: If you uncheck all transformers so that none is selected, Driverless AI will ignore this and will use the default list of transformers for that experiment." }, { "output": " The equivalent config.toml parameter is ``included_transformers``. .. _included_models:\n\n``included_models``\n~\n\n.. dropdown:: Include Specific Models\n\t:open:\n\n\tSpecify the types of models that you want Driverless AI to build in the experiment." }, { "output": " Note: The ImbalancedLightGBM and ImbalancedXGBoostGBM models are closely tied with the :ref:`sampling_method_for_imbalanced` option." }, { "output": " If the target fraction proves to be above the allowed imbalance threshold, then sampling will be triggered." }, { "output": " - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are ENABLED and the :ref:`sampling_method_for_imbalanced` is DISABLED, sampling will not be used, and these imbalanced models will be disabled." }, { "output": " .. _included_pretransformers:\n\n``included_pretransformers``\n\n\n.. dropdown:: Include Specific Preprocessing Transformers\n\t:open:\n\n\tSpecify which :ref:`transformers ` to use for preprocessing before other transformers are activated." }, { "output": " Notes:\n\n\t- Preprocessing transformers and all other layers of transformers are part of the Python and (if applicable) MOJO scoring packages." }, { "output": " For example, a preprocessing transformer can perform interactions, string concatenations, or date extractions as a preprocessing step before the next layer of Date and DateTime transformations are performed." }, { "output": " However, one can use a run-time data recipe to (e.g.) convert a float date-time into string date-time, and this will be used by Driverless AIs Date and DateTime transformers as well as auto-detection of time series." }, { "output": " the dataset\n\t must have time column and groups prepared ahead of experiment by user or via a one-time :ref:`data recipe `." }, { "output": " .. _num_pipeline_layers:\n\n``num_pipeline_layers``\n~\n\n.. dropdown:: Number of Pipeline Layers\n\t:open:\n\n\tSpecify the number of pipeline layers." }, { "output": " The equivalent config.toml parameter is ``num_pipeline_layers``. Note: This does not include the preprocessing layer specified by the :ref:`included_pretransformers` expert setting." }, { "output": " Avoids need for separate data preparation step, builds data preparation within experiment and within python scoring package." }, { "output": " The equivalent config.toml parameter is ``included_datas``. .. _included_individuals:\n\n``included_individuals``\n\n\n.. dropdown:: Include Specific Individuals\n\t:open:\n\n\tIn Driverless AI, every completed experiment automatically generates Python code for the experiment that corresponds to the individual(s) used to build the final model." }, { "output": " This feature gives you code-first access to a significant portion of DAI's internal transformer and model generation process." }, { "output": " - Select recipe display names of custom individuals through the UI. If the number of included custom individuals is less than DAI needs, then the remaining individuals are freshly generated." }, { "output": " For more information, see :ref:`individual_recipe`. ``threshold_scorer``\n\n\n.. dropdown:: Scorer to Optimize Threshold to Be Used in Other Confusion-Matrix Based Scorers (For Binary Classification)\n\t:open:\n\n\tSpecify the scorer used to optimize the binary probability threshold that is being used in related Confusion Matrix based scorers such as Precision, Recall, FalsePositiveRate, FalseDiscoveryRate, FalseOmissionRate, TrueNegativeRate, FalseNegativeRate, and NegativePredictiveValue." }, { "output": " If this is not possible, F1 is used. - F05 More weight on precision, less weight on recall. - F1: Equal weight on precision and recall." }, { "output": " - MCC: Use this option when all classes are equally important. ``prob_add_genes``\n\n\n.. dropdown:: Probability to Add Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to add genes or instances of transformers with specific attributes." }, { "output": " This value defaults to 0.5. ``prob_addbest_genes``\n\n\n.. dropdown:: Probability to Add Best Shared Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to add genes or instances of transformers with specific attributes that have shown to be beneficial to other individuals within the population." }, { "output": " ``prob_prune_genes``\n\n\n.. dropdown:: Probability to Prune Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to prune genes or instances of transformers with specific attributes." }, { "output": " ``prob_perturb_xgb``\n\n\n.. dropdown:: Probability to Mutate Model Parameters\n\t:open:\n\n\tSpecify the unnormalized probability to change model hyper parameters." }, { "output": " ``prob_prune_by_features``\n\n\n.. dropdown:: Probability to Prune Weak Features\n\t:open:\n\n\tSpecify the unnormalized probability to prune features that have low variable importance instead of pruning entire instances of genes/transformers." }, { "output": " ``skip_transformer_failures``\n~\n\n.. dropdown:: Whether to Skip Failures of Transformers\n\t:open:\n\n\tSpecify whether to avoid failed transformers." }, { "output": " ``skip_model_failures``\n~\n\n.. dropdown:: Whether to Skip Failures of Models\n\t:open:\n\n\tSpecify whether to avoid failed models." }, { "output": " This is enabled by default. ``detailed_skip_failure_messages_level``\n\n\n.. dropdown:: Level to Log for Skipped Failures\n\t:open:\n\n\tSpecify one of the following levels for the verbosity of log failure messages for skipped transformers or models:\n\n\t- 0 = Log simple message\n\t- 1 = Log code line plus message (Default)\n\t- 2 = Log detailed stack traces\n\n``notify_failures``\n~\n\n.. dropdown:: Whether to Notify About Failures of Transformers or Models or Other Recipe Failures\n\t:open:\n\n\tSpecify whether to display notifications in the GUI about recipe failures." }, { "output": " The equivalent config.toml parameter is ``notify_failures``. ``acceptance_test_timeout``\n~\n\n.. dropdown:: Timeout in Minutes for Testing Acceptance of Each Recipe\n\t:open:\n\n\tSpecify the number of minutes to wait until a recipe's acceptance testing is aborted." }, { "output": " .. _install-gcp-offering:\n\nInstall the Google Cloud Platform Offering\n\n\nThis section describes how to install and start Driverless AI in a Google Compute environment using the GCP Marketplace." }, { "output": " If you don't have an account, go to https://console.cloud.google.com/getting-started to create one." }, { "output": " By default, GCP allocates a maximum of 8 CPUs and no GPUs. Our default recommendation for launching Driverless AI is 32 CPUs, 120 GB RAM, and 2 P100 NVIDIA GPUs." }, { "output": " Refer to https://cloud.google.com/compute/quotas for more information, including information on how to check your quota and request additional quota." }, { "output": " In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/. 2." }, { "output": " .. image:: ../images/google_cloud_launcher.png\n :align: center\n :height: 266\n :width: 355\n\n3." }, { "output": " The following page will display. .. image:: ../images/google_driverlessai_offering.png\n :align: center\n\n4." }, { "output": " (If necessary, refer to `Google Compute Instance Types `__ for information about machine and GPU types.)" }, { "output": " (This defaults to 32 CPUs and 120 GB RAM.) - Specify a GPU type. (This defaults to a p100 GPU.) - Optionally change the number of GPUs." }, { "output": " - Specify the boot disk type and size. - Optionally change the network name and subnetwork names. Be sure that whichever network you specify has port 12345 exposed." }, { "output": " Driverless AI will begin deploying. Note that this can take several minutes. .. image:: ../images/google_deploy_compute_engine.png\n :align: center\n\n5." }, { "output": " This page includes the instance ID and the username (always h2oai) and password that will be required when starting Driverless AI." }, { "output": " .. image:: ../images/google_deploy_summary.png\n :align: center\n\n6. In your browser, go to https://[External_IP]:12345 to start Driverless AI." }, { "output": " Agree to the Terms and Conditions. 8. Log in to Driverless AI using your user name and password. 9." }, { "output": " a. In order to enable GCS and Google BigQuery access, you must pass the running instance a service account json file configured with GCS and GBQ access." }, { "output": " Obtain a functioning service account json file from `GCP `__, rename it to \"service_account.json\", and copy it to the Ubuntu user on the running instance." }, { "output": " c. Restart the machine for the changes to take effect. .. code-block:: bash\n\n sudo systemctl stop dai\n\n # Wait for the system to stop\n\n # Verify that the system is no longer running\n sudo systemctl status dai\n\n # Restart the system\n sudo systemctl start dai\n\nUpgrading the Google Cloud Platform Offering\n\n\nPerform the following steps to upgrade the Driverless AI Google Platform offering." }, { "output": " Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf." }, { "output": " .. code-block:: bash\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " .. _time-series-settings:\n\nTime Series Settings\n\n\n.. _time-series-lag-based-recipe:\n\n``time_series_recipe``\n\n.. dropdown:: Time-Series Lag-Based Recipe\n\t:open:\n\n\tThis recipe specifies whether to include Time Series lag features when training a model with a provided (or autodetected) time column." }, { "output": " Lag features are the primary automatically generated time series features and represent a variable's past values." }, { "output": " For example, if the sales today are 300, and sales of yesterday are 250, then the lag of one day for sales is 250." }, { "output": " Lagging variables are important in time series because knowing what happened in different time periods in the past can greatly facilitate predictions for the future." }, { "output": " Ensembling is also disabled if a time column is selected or if time column is set to [Auto] on the experiment setup screen." }, { "output": " .. figure:: ../images/time_series_lag.png\n\t :alt: Lag\n\n``time_series_leaderboard_mode``\n\n.. dropdown:: Control the automatic time-series leaderboard mode\n\t:open:\n\n\tSelect from the following options:\n\n - 'diverse': explore a diverse set of models built using various expert settings." }, { "output": " - 'sliding_window': If the forecast horizon is N periods, create a separate model for \"each of the (gap, horizon) pairs of (0,n), (n,n), (2*n,n), ..., (2*N-1, n) in units of time periods." }, { "output": " This can help to improve short-term forecasting quality. ``time_series_leaderboard_periods_per_model``\n~\n.. dropdown:: Number of periods per model if time_series_leaderboard_mode is 'sliding_window'\n\t:open:\n\n\tSpecify the number of periods per model if ``time_series_leaderboard_mode`` is set to ``sliding_window``." }, { "output": " .. _time_series_merge_splits:\n\n``time_series_merge_splits``\n\n.. dropdown:: Larger Validation Splits for Lag-Based Recipe\n\t:open:\n\n\tSpecify whether to create larger validation splits that are not bound to the length of the forecast horizon." }, { "output": " This is enabled by default. ``merge_splits_max_valid_ratio``\n\n.. dropdown:: Maximum Ratio of Training Data Samples Used for Validation\n\t:open:\n\n\tSpecify the maximum ratio of training data samples used for validation across splits when larger validation splits are created (see :ref:`time_series_merge_splits` setting)." }, { "output": " .. _fixed_size_splits:\n\n``fixed_size_splits``\n~\n.. dropdown:: Fixed-Size Train Timespan Across Splits\n\t:open:\n\n\tSpecify whether to keep a fixed-size train timespan across time-based splits during internal validation." }, { "output": " This is disabled by default. ``time_series_validation_fold_split_datetime_boundaries``\n~\n.. dropdown:: Custom Validation Splits for Time-Series Experiments\n\t:open:\n\n\tSpecify date or datetime timestamps (in the same format as the time column) to use for custom training and validation splits." }, { "output": " This value defaults to 30. .. _holiday-calendar:\n\n``holiday_features``\n\n.. dropdown:: Generate Holiday Features\n\t:open:\n\n\tFor time-series experiments, specify whether to generate holiday features for the experiment." }, { "output": " ``holiday_countries``\n~\n.. dropdown:: Country code(s) for holiday features\n\t:open:\n\n\tSpecify country codes in the form of a list that is used to look up holidays." }, { "output": " ``override_lag_sizes``\n\n.. dropdown:: Time-Series Lags Override\n\t:open:\n\n\tSpecify the override lags to be used." }, { "output": " The following examples show the variety of different methods that can be used to specify override lags:\n\n\t- \"[0]\" disable lags\n\t- \"[7, 14, 21]\" specifies this exact list\n\t- \"21\" specifies every value from 1 to 21\n\t- \"21:3\" specifies every value from 1 to 21 in steps of 3\n\t- \"5-21\" specifies every value from 5 to 21\n\t- \"5-21:3\" specifies every value from 5 to 21 in steps of 3\n\n``override_ufapt_lag_sizes``\n\n.. dropdown:: Lags Override for Features That are not Known Ahead of Time\n\t:open:\n\n\tSpecify lags override for non-target features that are not known ahead of time." }, { "output": " - \"[0]\" disable lags\n\t- \"[7, 14, 21]\" specifies this exact list\n\t- \"21\" specifies every value from 1 to 21\n\t- \"21:3\" specifies every value from 1 to 21 in steps of 3\n\t- \"5-21\" specifies every value from 5 to 21\n\t- \"5-21:3\" specifies every value from 5 to 21 in steps of 3\n\n``min_lag_size``\n\n.. dropdown:: Smallest Considered Lag Size\n\t:open:\n\n\tSpecify a minimum considered lag size." }, { "output": " ``allow_time_column_as_feature``\n\n.. dropdown:: Enable Feature Engineering from Time Column\n\t:open:\n\n\tSpecify whether to enable feature engineering based on the selected time column, e.g." }, { "output": " This is enabled by default. ``allow_time_column_as_numeric_feature``\n\n.. dropdown:: Allow Integer Time Column as Numeric Feature\n\t:open:\n\n\tSpecify whether to enable feature engineering from an integer time column." }, { "output": " This is disabled by default. ``datetime_funcs``\n\n.. dropdown:: Allowed Date and Date-Time Transformations\n\t:open:\n\n\tSpecify the date or date-time transformations to allow Driverless AI to use." }, { "output": " Note that ``get_num`` can lead to overfitting if used on IID problems and is disabled by default. .. _filter_datetime_funcs:\n\n``filter_datetime_funcs``\n~\n.. dropdown:: Auto Filtering of Date and Date-Time Transformations\n\t:open:\n\n\tWhether to automatically filter out date and date-time transformations that would lead to unseen values in the future." }, { "output": " ``allow_tgc_as_features``\n~\n.. dropdown:: Consider Time Groups Columns as Standalone Features\n\t:open:\n\n\tSpecify whether to consider time groups columns as standalone features." }, { "output": " ``allowed_coltypes_for_tgc_as_features``\n\n.. dropdown:: Which TGC Feature Types to Consider as Standalone Features\n\t:open:\n\n\tSpecify whether to consider time groups columns (TGC) as standalone features." }, { "output": " Available types are numeric, categorical, ohe_categorical, datetime, date, and text. All types are selected by default." }, { "output": " Also note that if \"Time Series Lag-Based Recipe\" is disabled, then all time group columns are allowed features." }, { "output": " This is set to Auto by default. ``tgc_only_use_all_groups``\n~\n.. dropdown:: Always Group by All Time Groups Columns for Creating Lag Features\n\t:open:\n\n\tSpecify whether to group by all time groups columns for creating lag features, instead of sampling from them." }, { "output": " ``tgc_allow_target_encoding``\n~\n.. dropdown:: Allow Target Encoding of Time Groups Columns\n\t:open:\n\n\tSpecify whether it is allowed to target encode the time groups columns." }, { "output": " Notes:\n\n\t- This setting is not affected by ``allow_tgc_as_features``. - Subgroups can be encoded by disabling ``tgc_only_use_all_groups``." }, { "output": " This is enabled by default. This can be useful for MLI, but it will slow down the experiment considerably when enabled." }, { "output": " ``time_series_validation_splits``\n~\n.. dropdown:: Number of Time-Based Splits for Internal Model Validation\n\t:open:\n\n\tSpecify a fixed number of time-based splits for internal model validation." }, { "output": " This value defaults to -1 (auto). ``time_series_splits_max_overlap``\n\n.. dropdown:: Maximum Overlap Between Two Time-Based Splits\n\t:open:\n\n\tSpecify the maximum overlap between two time-based splits." }, { "output": " This value defaults to 0.5. ``time_series_max_holdout_splits``\n\n.. dropdown:: Maximum Number of Splits Used for Creating Final Time-Series Model's Holdout Predictions\n\t:open:\n\n\tSpecify the maximum number of splits used for creating the final time-series Model's holdout predictions." }, { "output": " Use \t``time_series_validation_splits`` to control amount of time-based splits used for model validation." }, { "output": " This setting is used for MLI and calculating metrics. Note that predictions can be slightly less accurate when this setting is enabled." }, { "output": " ``mli_ts_fast_approx_contribs``\n~\n.. dropdown:: Whether to Speed up Calculation of Shapley Values for Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to speed up Shapley values for time-series holdout predictions for back-testing on training data." }, { "output": " Note that predictions can be slightly less accurate when this setting is enabled. This is enabled by default." }, { "output": " This can be useful for MLI, but it can slow down the experiment when enabled. If this setting is disabled, MLI will generate Shapley values on demand." }, { "output": " ``time_series_min_interpretability``\n\n.. dropdown:: Lower Limit on Interpretability Setting for Time-Series Experiments (Implicitly Enforced)\n\t:open:\n\n\tSpecify the lower limit on interpretability setting for time-series experiments." }, { "output": " To disable this setting, set this value to 1. ``lags_dropout``\n\n.. dropdown:: Dropout Mode for Lag Features\n\t:open:\n\n\tSpecify the dropout mode for lag features in order to achieve an equal n.a." }, { "output": " Independent mode performs a simple feature-wise dropout. Dependent mode takes the lag-size dependencies per sample/row into account." }, { "output": " ``prob_lag_non_targets``\n\n.. dropdown:: Probability to Create Non-Target Lag Features\n\t:open:\n\n\tLags can be created on any feature as well as on the target." }, { "output": " This value defaults to 0.1. .. _rolling-test-set-method:\n\n``rolling_test_method``\n~\n.. dropdown:: Method to Create Rolling Test Set Predictions\n\t:open:\n\n\tSpecify the method used to create rolling test set predictions." }, { "output": " TTA is enabled by default. Notes: \n\t\n\t- This setting only applies to the test set that is provided by the user during an experiment." }, { "output": " ``fast_tta_internal``\n~\n.. dropdown:: Fast TTA for Internal Validation\n\t:open:\n\n\tSpecify whether the genetic algorithm applies Test Time Augmentation (TTA) in one pass instead of using rolling windows for validation splits longer than the forecast horizon." }, { "output": " ``prob_default_lags``\n~\n.. dropdown:: Probability for New Time-Series Transformers to Use Default Lags\n\t:open:\n\n\tSpecify the probability for new lags or the EWMA gene to use default lags." }, { "output": " This value defaults to 0.2. ``prob_lagsinteraction``\n\n.. dropdown:: Probability of Exploring Interaction-Based Lag Transformers\n\t:open:\n\n\tSpecify the unnormalized probability of choosing other lag time-series transformers based on interactions." }, { "output": " ``prob_lagsaggregates``\n~\n.. dropdown:: Probability of Exploring Aggregation-Based Lag Transformers\n\t:open:\n\n\tSpecify the unnormalized probability of choosing other lag time-series transformers based on aggregations." }, { "output": " .. _centering-detrending:\n\n``ts_target_trafo``\n~\n.. dropdown:: Time Series Centering or Detrending Transformation\n\t:open:\n\n\tSpecify whether to use centering or detrending transformation for time series experiments." }, { "output": " Linear or Logistic will remove the fitted linear or logistic trend, Centering will only remove the mean of the target signal and Epidemic will remove the signal specified by a `Susceptible-Infected-Exposed-Recovered-Dead `_ (SEIRD) epidemic model." }, { "output": " Notes:\n\n\t- MOJO support is currently disabled when this setting is enabled. - The Fast centering and linear detrending options use least squares fitting." }, { "output": " outliers. - Please see (:ref:`Custom Bounds for SEIRD Epidemic Model Parameters `) for further details on how to customize the bounds of the free SEIRD parameters." }, { "output": " The target column must correspond to *I(t)*, which represents infection cases as a function of time." }, { "output": " The model's value is then subtracted from the training response, and the residuals are passed to the feature engineering and modeling pipeline." }, { "output": " The following is a list of free parameters:\n\n\t- N: Total population, *N = S+E+I+R+D*\n\t- beta: Rate of exposure (*S* -> *E*)\n\t- gamma: Rate of recovering (*I* -> *R*)\n\t- delta: Incubation period\n\t- alpha: Fatality rate\n\t- rho: Rate at which individuals expire\n\t- lockdown: Day of lockdown (-1 => no lockdown)\n\t- beta_decay: Beta decay due to lockdown\n\t- beta_decay_rate: Speed of beta decay\n\n\tProvide upper or lower bounds for each parameter you want to control." }, { "output": " For example:\n\n\t::\n\n\t ts_target_trafo_epidemic_params_dict=\"{'N_min': 1000, 'beta_max': 0.2}\"\n\n\tRefer to https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology and https://arxiv.org/abs/1411.3435 for more information on the SEIRD model." }, { "output": " To get the SEIR model, set ``alpha_min=alpha_max=rho_min=rho_max=beta_decay_rate_min=beta_decay_rate_max=0`` and ``lockdown_min=lockdown_max=-1``." }, { "output": " Select from the following:\n\n\t- I (Default): Infected\n\t- R: Recovered\n\t- D: Deceased\n\n.. _ts-target-transformation:\n\n``ts_lag_target_trafo``\n~\n.. dropdown:: Time Series Lag-Based Target Transformation\n\t:open:\n\n\tSpecify whether to use either the difference between or ratio of the current target and a lagged target." }, { "output": " Notes:\n\n\t- MOJO support is currently disabled when this setting is enabled. - The corresponding lag size is specified with the ``ts_target_trafo_lag_size`` expert setting." }, { "output": " .. _install-on-aws:\n\nInstall on AWS\n\n\nDriverless AI can be installed on Amazon AWS using the AWS Marketplace AMI or the AWS Community AMI." }, { "output": " Google Cloud Storage Setup\n\n\nDriverless AI lets you explore Google Cloud Storage data sources from within the Driverless AI application." }, { "output": " This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication." }, { "output": " Obtain a JSON authentication file from `GCP `__." }, { "output": " Mount the JSON file to the Docker instance. 3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json config option." }, { "output": " You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n'\n\n- ``gcs_path_to_service_account_json``: Specifies the path to the /json_auth_file.json file." }, { "output": " Start GCS with Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the GCS data connector with authentication by passing the JSON authentication file." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,gcs\" \\\n -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON=\"/service_account_json.json\" \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v `pwd`/service_account_json.json:/service_account_json.json \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure the GCS data connector options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, gcs\"``\n - ``gcs_path_to_service_account_json = \"/service_account_json.json\"`` \n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the GCS data connector with authentication by passing the JSON authentication file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, gcs\"\n\n # GCS Connector credentials\n # example (suggested) \"/licenses/my_service_account_json.json\"\n gcs_path_to_service_account_json = \"/service_account_json.json\"\n\n 3." }, { "output": " .. _model-settings:\n\nModel Settings\n\n\n``enable_constant_model``\n~\n.. dropdown:: Constant Models\n\t:open:\n\n\tSpecify whether to enable :ref:`constant models `." }, { "output": " ``enable_decision_tree``\n\n.. dropdown:: Decision Tree Models\n\t:open:\n\n\tSpecify whether to build Decision Tree models as part of the experiment." }, { "output": " In this case, Driverless AI will build Decision Tree models if interpretability is greater than or equal to the value of ``decision_tree_interpretability_switch`` (which defaults to 7) and accuracy is less than or equal to ``decision_tree_accuracy_switch`` (which defaults to 7)." }, { "output": " GLMs are very interpretable models with one coefficient per feature, an intercept term and a link function." }, { "output": " ``enable_xgboost_gbm``\n\n.. dropdown:: XGBoost GBM Models\n\t:open:\n\n\tSpecify whether to build XGBoost models as part of the experiment (for both the feature engineering part and the final model)." }, { "output": " This is set to Auto by default. In this case, Driverless AI will use XGBoost unless the number of rows * columns is greater than a threshold." }, { "output": " ``enable_lightgbm``\n~\n.. dropdown:: LightGBM Models\n\t:open:\n\n\tSpecify whether to build LightGBM models as part of the experiment." }, { "output": " This is set to Auto (enabled) by default. ``enable_xgboost_dart``\n~\n.. dropdown:: XGBoost Dart Models\n\t:open:\n\n\tSpecify whether to use XGBoost's Dart method when building models for experiment (for both the feature engineering part and the final model)." }, { "output": " .. _enable_xgboost_rapids:\n\n``enable_xgboost_rapids``\n~\n.. dropdown:: Enable RAPIDS-cuDF extensions to XGBoost GBM/Dart\n\t:open:\n\n\tSpecify whether to enable RAPIDS extensions to XGBoost GBM/Dart." }, { "output": " The equivalent config.toml parameter is ``enable_xgboost_rapids`` and the default value is False. Disabled for dask multinode models due to bug in dask_cudf and xgboost." }, { "output": " This setting is disabled unless switched on. .. _enable_xgboost_gbm_dask:\n\n``enable_xgboost_gbm_dask``\n~\n.. dropdown:: Enable Dask_cuDF (multi-GPU) XGBoost GBM\n\t:open:\n\n\tSpecify whether to enable Dask_cudf (multi-GPU) version of XGBoost GBM." }, { "output": " Only applicable for single final model without early stopping. No Shapley possible. The equivalent config.toml parameter is ``enable_xgboost_gbm_dask`` and the default value is \"auto\"." }, { "output": " This option is disabled unless switched on. Only applicable for single final model without early stopping." }, { "output": " The equivalent config.toml parameter is ``enable_xgboost_dart_dask`` and the default value is \"auto\"." }, { "output": " .. _enable_lightgbm_dask:\n\n``enable_lightgbm_dask``\n\n.. dropdown:: Enable Dask (multi-node) LightGBM\n\t:open:\n\n\tSpecify whether to enable multi-node LightGBM." }, { "output": " The equivalent config.toml parameter is ``enable_lightgbm_dask`` and default value is \"auto\". To enable multinode Dask see :ref:`Dask Multinode Training `." }, { "output": " \"auto\" and \"on\" are same currently. Dask mode for hyperparameter search is enabled if:\n\n\t\t1) Have a :ref:`Dask multinode cluster ` or multi-GPU node and model uses 1 GPU for each model( see :ref:`num-gpus-per-model`)." }, { "output": " The equivalent config.toml parameter is ``enable_hyperopt_dask`` and the default value is \"auto\". .. _num_inner_hyperopt_trials_prefinal:\n\n``num_inner_hyperopt_trials_prefinal``\n\n.. dropdown:: Number of trials for hyperparameter optimization during model tuning only\n\t:open:\n\n\tSpecify the number of trials for Optuna hyperparameter optimization for tuning and evolution of models." }, { "output": " 0 means no trials. For small data, 100 is fine, while for larger data smaller values are reasonable if need results quickly." }, { "output": " The equivalent config.toml parameter is ``num_inner_hyperopt_trials_prefinal`` and the default value is 0." }, { "output": " However, this can overfit on a single fold when doing tuning or evolution, and if using Cross Validation then, averaging the fold hyperparameters can lead to unexpected results." }, { "output": " If using RAPIDS or DASK, this is number of trials for rapids-cudf hyperparameter optimization within XGBoost GBM/Dart and LightGBM, and hyperparameter optimization keeps data on GPU entire time." }, { "output": " This setting applies to final model only, even if num_inner_hyperopt_trials=0. The equivalent config.toml parameter is ``num_inner_hyperopt_trials_final`` and the default value is 0." }, { "output": " The default value is -1, means all. 0 is same as choosing no Optuna trials. Might be only beneficial to optimize hyperparameters of best individual (i.e." }, { "output": " The default value is -1, means all. The equivalent config.toml parameter is ``num_hyperopt_individuals_final``\n\n``optuna_pruner``\n~\n.. dropdown:: Optuna Pruners\n\t:open:\n\n\t`Optuna Pruner `__ algorithm to use for early stopping of unpromising trials (applicable to XGBoost and LightGBM that support Optuna callbacks)." }, { "output": " To disable choose None. The equivalent config.toml parameter is ``optuna_pruner``\n\n``optuna_sampler``\n\n.. dropdown:: Optuna Samplers\n\t:open:\n\n\t`Optuna Sampler `__ algorithm to use for narrowing down and optimizing the search space (applicable to XGBoost and LightGBM that support Optuna callbacks)." }, { "output": " To disable choose None. The equivalent config.toml parameter is ``optuna_sampler``\n\n``enable_xgboost_hyperopt_callback``\n\n\n.. dropdown:: Enable Optuna XGBoost Pruning callback\n\t:open:\n\n\tSpecify whether to enable Optuna's XGBoost Pruning callback to abort unpromising runs." }, { "output": " This not is enabled when tuning learning rate. The equivalent config.toml parameter is ``enable_xgboost_hyperopt_callback``\n\n``enable_lightgbm_hyperopt_callback``\n~\n.. dropdown:: Enable Optuna LightGBM Pruning callback\n\t:open:\n\n\tSpecify whether to enable Optuna's LightGBM Pruning callback to abort unpromising runs." }, { "output": " This not is enabled when tuning learning rate. The equivalent config.toml parameter is ``enable_lightgbm_hyperopt_callback``\n\n``enable_tensorflow``\n~\n.. dropdown:: TensorFlow Models\n\t:open:\n\n\tSpecify whether to build `TensorFlow `__ models as part of the experiment (usually only for text features engineering and for the final model unless it's used exclusively)." }, { "output": " This is set to Auto by default (not used unless the number of classes is greater than 10). TensorFlow models are not yet supported by Java MOJOs (only Python scoring pipelines and C++ MOJOs are supported)." }, { "output": " By default, this parameter is set to auto i.e Driverless decides internally whether to use the algorithm for the experiment." }, { "output": " ``enable_ftrl``\n~\n.. dropdown:: FTRL Models\n\t:open:\n\n\tSpecify whether to build Follow the Regularized Leader (FTRL) models as part of the experiment." }, { "output": " FTRL supports binomial and multinomial classification for categorical targets, as well as regression for continuous targets." }, { "output": " ``enable_rulefit``\n\n.. dropdown:: RuleFit Models\n\t:open:\n\n\tSpecify whether to build `RuleFit `__ models as part of the experiment." }, { "output": " Note that multiclass classification is not yet supported for RuleFit models. Rules are stored to text files in the experiment directory for now." }, { "output": " .. _zero-inflated:\n\n``enable_zero_inflated_models``\n~\n.. dropdown:: Zero-Inflated Models\n\t:open:\n\n\tSpecify whether to enable the automatic addition of :ref:`zero-inflated models ` for regression problems with zero-inflated target values that meet certain conditions:\n\n\t::\n\n\t y >= 0, y.std() > y.mean()\")\n\n\tThis is set to Auto by default." }, { "output": " Select one or more of the following:\n\n\t- gbdt: Boosted trees\n\t- rf_early_stopping: Random Forest with early stopping\n\t- rf: Random Forest\n\t- dart: Dropout boosted trees with no early stopping\n\n\tgbdt and rf are both enabled by default." }, { "output": " This is disabled by default. Notes:\n\n\t- Only supported for CPU. - A MOJO is not built when this is enabled." }, { "output": " LightGBM CUDA is supported on Linux x86-64 environments. ``show_constant_model``\n~\n.. dropdown:: Whether to Show Constant Models in Iteration Panel\n\t:open:\n\n\tSpecify whether to show constant models in the iteration panel." }, { "output": " ``params_tensorflow``\n~\n.. dropdown:: Parameters for TensorFlow\n\t:open:\n\n\tSpecify specific parameters for TensorFlow to override Driverless AI parameters." }, { "output": " Different strategies for using TensorFlow parameters can be viewed `here `__." }, { "output": " This defaults to 3000. Depending on accuracy settings, a fraction of this limit will be used. ``n_estimators_list_no_early_stopping``\n~\n.. dropdown:: n_estimators List to Sample From for Model Mutations for Models That Do Not Use Early Stopping\n\t:open:\n\n\tFor LightGBM, the dart and normal random forest modes do not use early stopping." }, { "output": " ``min_learning_rate_final``\n~\n.. dropdown:: Minimum Learning Rate for Final Ensemble GBM Models\n\t:open:\n\n\tThis value defaults to 0.01." }, { "output": " Then, one can try increasing the learning rate by raising this minimum, or one can try increasing the maximum number of trees/iterations." }, { "output": " This value defaults to 0.05. ``max_nestimators_feature_evolution_factor``\n\n.. dropdown:: Reduction Factor for Max Number of Trees/Iterations During Feature Evolution\n\t:open:\n\n\tSpecify the factor by which the value specified by the :ref:`max-trees-iterations` setting is reduced for tuning and feature evolution." }, { "output": " So by default, Driverless AI will produce no more than 0.2 * 3000 trees/iterations during feature evolution." }, { "output": " absolute delta between training and validation scores for tree models\n\t:open:\n\n\tModify early stopping behavior for tree-based models (LightGBM, XGBoostGBM, CatBoost) such that training score (on training data, not holdout) and validation score differ no more than this absolute value (i.e., stop adding trees once abs(train_score - valid_score) > max_abs_score_delta_train_valid)." }, { "output": " This option is Experimental, and only for expert use to keep model complexity low. To disable, set to 0.0." }, { "output": " .. _max_rel_score_delta_train_valid:\n\n``max_rel_score_delta_train_valid``\n~\n.. dropdown:: Max. relative delta between training and validation scores for tree models\n\t:open:\n\n\tModify early stopping behavior for tree-based models (LightGBM, XGBoostGBM, CatBoost) such that training score (on training data, not holdout) and validation score differ no more than this relative value (i.e., stop adding trees once abs(train_score - valid_score) > max_rel_score_delta_train_valid * abs(train_score))." }, { "output": " This option is Experimental, and only for expert use to keep model complexity low. To disable, set to 0.0." }, { "output": " ``min_learning_rate``\n~\n.. dropdown:: Minimum Learning Rate for Feature Engineering GBM Models\n\t:open:\n\n\tSpecify the minimum learning rate for feature engineering GBM models." }, { "output": " ``max_learning_rate``\n~\n.. dropdown:: Max Learning Rate for Tree Models\n\t:open:\n\n\tSpecify the maximum learning rate for tree models during feature engineering." }, { "output": " This value defaults to 0.5. ``max_epochs``\n\n.. dropdown:: Max Number of Epochs for TensorFlow/FTRL\n\t:open:\n\n\tWhen building TensorFlow or FTRL models, specify the maximum number of epochs to train models with (it might stop earlier)." }, { "output": " This option is ignored if TensorFlow models and/or FTRL models is disabled. ``max_max_depth``\n~\n.. dropdown:: Max Tree Depth\n\t:open:\n\n\tSpecify the maximum tree depth." }, { "output": " This value defaults to 12. ``max_max_bin``\n~\n.. dropdown:: Max max_bin for Tree Features\n\t:open:\n\n\tSpecify the maximum ``max_bin`` for tree features." }, { "output": " ``rulefit_max_num_rules``\n~\n.. dropdown:: Max Number of Rules for RuleFit\n\t:open:\n\n\tSpecify the maximum number of rules to be used for RuleFit models." }, { "output": " .. _ensemble_meta_learner:\n\n``ensemble_meta_learner``\n~\n.. dropdown:: Ensemble Level for Final Modeling Pipeline\n\t:open:\n\n\tModel to combine base model predictions, for experiments that create a final pipeline\n\tconsisting of multiple base models:\n\n\t- blender: Creates a linear blend with non-negative weights that add to 1 (blending) - recommended\n\t- extra_trees: Creates a tree model to non-linearly combine the base models (stacking) - experimental, and recommended to also set enable :ref:`cross_validate_meta_learner`." }, { "output": " (Default)\n\t- 0 = No ensemble, only final single model on validated iteration/tree count. Note that holdout predicted probabilities will not be available." }, { "output": " - 1 = 1 model, multiple ensemble folds (cross-validation)\n\t- 2 = 2 models, multiple ensemble folds (cross-validation)\n\t- 3 = 3 models, multiple ensemble folds (cross-validation)\n\t- 4 = 4 models, multiple ensemble folds (cross-validation)\n\n\tThe equivalent config.toml parameter is ``fixed_ensemble_level``." }, { "output": " Especially recommended for ensemble_meta_learner='extra_trees', to make unbiased training holdout predictions." }, { "output": " Not needed for ensemble_meta_learner='blender'. ``cross_validate_single_final_model``\n~\n.. dropdown:: Cross-Validate Single Final Model\n\t:open:\n\n\tDriverless AI normally produces a single final model for low accuracy settings (typically, less than 5)." }, { "output": " The final pipeline will build :math:`N+1` models, with N-fold cross validation for the single final model." }, { "output": " Note that the setting for this option is ignored for time-series experiments or when a validation dataset is provided." }, { "output": " Specify a lower value to avoid excessive tuning, or specify a higher to perform enhanced tuning. This option defaults to -1 (auto)." }, { "output": " This is set to off by default. Choose from the following options:\n\n\t- auto: sample both classes as needed, depending on data\n\t- over_under_sampling: over-sample the minority class and under-sample the majority class, depending on data\n\t- under_sampling: under-sample the majority class to reach class balance\n\t- off: do not perform any sampling\n\n\tThis option is closely tied with the Imbalanced Light GBM and Imbalanced XGBoost GBM models, which can be enabled/disabled on the Recipes tab under :ref:`included_models`." }, { "output": " If the target fraction proves to be above the allowed imbalance threshold, then sampling will be triggered." }, { "output": " The setting here will be ignored. ``imbalance_sampling_threshold_min_rows_original``\n\n.. dropdown:: Threshold for Minimum Number of Rows in Original Training Data to Allow Imbalanced Sampling\n\t:open:\n\n\tSpecify a threshold for the minimum number of rows in the original training data that allow imbalanced sampling." }, { "output": " ``imbalance_ratio_sampling_threshold``\n\n.. dropdown:: Ratio of Majority to Minority Class for Imbalanced Binary Classification to Trigger Special Sampling Techniques (if Enabled)\n\t:open:\n\n\tFor imbalanced binary classification problems, specify the ratio of majority to minority class." }, { "output": " This value defaults to 5. ``heavy_imbalance_ratio_sampling_threshold``\n\n.. dropdown:: Ratio of Majority to Minority Class for Heavily Imbalanced Binary Classification to Only Enable Special Sampling Techniques (if Enabled)\n\t:open:\n\n\tFor heavily imbalanced binary classification, specify the ratio of the majority to minority class equal and above which to enable only special imbalanced models on the full original data without upfront sampling." }, { "output": " ``imbalance_sampling_number_of_bags``\n~\n.. dropdown:: Number of Bags for Sampling Methods for Imbalanced Binary Classification (if Enabled)\n\t:open:\n\n\tSpecify the number of bags for sampling methods for imbalanced binary classification." }, { "output": " ``imbalance_sampling_max_number_of_bags``\n~\n.. dropdown:: Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification\n\t:open:\n\n\tSpecify the limit on the number of bags for sampling methods for imbalanced binary classification." }, { "output": " ``imbalance_sampling_max_number_of_bags_feature_evolution``\n~\n.. dropdown:: Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification During Feature Evolution Phase\n\t:open:\n\n\tSpecify the limit on the number of bags for sampling methods for imbalanced binary classification." }, { "output": " Note that this setting only applies to shift, leakage, tuning, and feature evolution models. To limit final models, use the Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification setting." }, { "output": " This setting controls the approximate number of bags and is only active when the \"Hard limit on number of bags for sampling methods for imbalanced binary classification during feature evolution phase\" option is set to -1." }, { "output": " ``imbalance_sampling_target_minority_fraction``\n~\n.. dropdown:: Target Fraction of Minority Class After Applying Under/Over-Sampling Techniques\n\t:open:\n\n\tSpecify the target fraction of a minority class after applying under/over-sampling techniques." }, { "output": " When starting from an extremely imbalanced original target, it can be advantageous to specify a smaller value such as 0.1 or 0.01." }, { "output": " ``ftrl_max_interaction_terms_per_degree``\n~\n.. dropdown:: Max Number of Automatic FTRL Interactions Terms for 2nd, 3rd, 4th order interactions terms (Each)\n\t:open:\n\n\tSamples the number of automatic FTRL interactions terms to no more than this value (for each of 2nd, 3rd, 4th order terms)." }, { "output": " When enabled, this setting provides error bars to validation and test scores based on the standard error of the bootstrap mean." }, { "output": " ``tensorflow_num_classes_switch``\n~\n.. dropdown:: For Classification Problems with This Many Classes, Default to TensorFlow\n\t:open:\n\n\tSpecify the number of classes above which to use TensorFlow when it is enabled." }, { "output": " (Models set to On, however, are still used.) This value defaults to 10. .. _compute-intervals:\n\n``prediction_intervals``\n\n.. dropdown:: Compute Prediction Intervals\n\t:open:\n\n\tSpecify whether to compute empirical prediction intervals based on holdout predictions." }, { "output": " .. _confidence-level:\n\n``prediction_intervals_alpha``\n\n.. dropdown:: Confidence Level for Prediction Intervals\n\t:open:\n\n\tSpecify a confidence level for prediction intervals." }, { "output": " ``dump_modelparams_every_scored_indiv``\n~\n\n.. dropdown:: Enable detailed scored model info\n\t:open:\n\n\tWhether to dump every scored individual's model parameters to csv/tabulated/json file produces files." }, { "output": " Install the Driverless AI AWS Community AMI\n-\n\nWatch the installation video `here `__." }, { "output": " Environment\n~\n\n++-++-+\n| Provider | Instance Type | Num GPUs | Suitable for |\n++=++=+\n| AWS | p2.xlarge | 1 | Experimentation |\n| +-++-+\n| | p2.8xlarge | 8 | Serious use |\n| +-++-+\n| | p2.16xlarge | 16 | Serious use |\n| +-++-+\n| | p3.2xlarge | 1 | Experimentation |\n| +-++-+\n| | p3.8xlarge | 4 | Serious use |\n| +-++-+\n| | p3.16xlarge | 8 | Serious use |\n| +-++-+\n| | g3.4xlarge | 1 | Experimentation |\n| +-++-+\n| | g3.8xlarge | 2 | Experimentation |\n| +-++-+\n| | g3.16xlarge | 4 | Serious use |\n++-++-+\n\n\nInstalling the EC2 Instance\n~\n\n1." }, { "output": " 2. In the upper right corner of the Amazon Web Services page, set the location drop-down. (Note: We recommend selecting the US East region because H2O's resources are stored there." }, { "output": " .. image:: ../images/ami_location_dropdown.png\n :align: center\n\n\n3. Select the EC2 option under the Compute section to open the EC2 Dashboard." }, { "output": " Click the Launch Instance button under the Create Instance section. .. image:: ../images/ami_launch_instance_button.png\n :align: center\n\n5." }, { "output": " .. image:: ../images/ami_select_h2oai_ami.png\n :align: center\n\n6. On the Choose an Instance Type page, select GPU compute in the Filter by dropdown." }, { "output": " Select a GPU compute instance from the available options. (We recommend at least 32 vCPUs.) Click the Next: Configure Instance Details button." }, { "output": " Specify the Instance Details that you want to configure. Create a VPC or use an existing one, and ensure that \"Auto-Assign Public IP\" is enabled and associated to your subnet." }, { "output": " .. image:: ../images/ami_configure_instance_details.png\n :align: center\n\n8. Specify the Storage Device settings." }, { "output": " The machine should have a minimum of 30 GB of disk space. Click Next: Add Tags. .. image:: ../images/ami_add_storage.png\n :align: center\n\n9." }, { "output": " Click Next: Configure Security Group. 10. Add the following security rules to enable SSH access to Driverless AI, then click Review and Launch." }, { "output": " 12. A popup will appear prompting you to select a key pair. This is required in order to SSH into the instance." }, { "output": " Be sure to accept the acknowledgement, then click Launch Instances to start the new instance. .. image:: ../images/ami_select_key_pair.png\n :align: center\n\n13." }, { "output": " Click the View Instances button to see information about the instance including the IP address. The Connect button on this page provides information on how to SSH into your instance." }, { "output": " Open a Terminal window and SSH into the IP address of the AWS instance. Replace the DNS name below with your instance DNS." }, { "output": " .. code-block:: bash\n\n chmod 400 mykeypair.pem\n\n15. If you selected a GPU-compute instance, then you must enable persistence and optimizations of the GPU." }, { "output": " Note also that these commands need to be run once every reboot. Refer to the following for more information: \n\n - http://docs.nvidia.com/deploy/driver-persistence/index.html\n - https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/optimize_gpu.html\n - https://www.migenius.com/articles/realityserver-on-aws\n\n .. code-block:: bash\n\n # g3:\n sudo nvidia-persistenced persistence-mode\n sudo nvidia-smi -acp 0\n sudo nvidia-smi auto-boost-permission=0\n sudo nvidia-smi auto-boost-default=0\n sudo nvidia-smi -ac \"2505,1177\"\n\n # p2:\n sudo nvidia-persistenced persistence-mode\n sudo nvidia-smi -acp 0\n sudo nvidia-smi auto-boost-permission=0\n sudo nvidia-smi auto-boost-default=0\n sudo nvidia-smi -ac \"2505,875\"\n\n # p3:\n sudo nvidia-persistenced persistence-mode\n sudo nvidia-smi -acp 0\n sudo nvidia-smi -ac \"877,1530\"\n\n\n16." }, { "output": " For example:\n\n .. code-block:: bash\n\n scp -i /path/mykeypair.pem ubuntu@ec2-34-230-6-230.compute-1.amazonaws.com:/path/to/file/to/be/copied/example.csv /path/of/destination/on/local/machine\n\n where:\n \n * ``i`` is the identify file option\n * ``mykeypair`` is the name of the private keypair file\n * ``ubuntu`` is the name of the private keypair file\n * ``ec2-34-230-6-230.compute-1.amazonaws.com`` is the public DNS name of the instance\n * ``example.csv`` is the file to transfer\n\n17." }, { "output": " Sign in to Driverless AI with the username h2oai and use the AWS InstanceID as the password. You will be prompted to enter your Driverless AI license key when you log in for the first time." }, { "output": " To stop the instance: \n\n1. On the EC2 Dashboard, click the Running Instances link under the Resources section." }, { "output": " Select the instance that you want to stop. 3. In the Actions drop down menu, select Instance State > Stop." }, { "output": " .. _nlp-settings:\n\nNLP Settings\n\n\n``enable_tensorflow_textcnn``\n~\n.. dropdown:: Enable Word-Based CNN TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Word-based CNN TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_tensorflow_textbigru``\n~\n.. dropdown:: Enable Word-Based BiGRU TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Word-based BiG-RU TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_tensorflow_charcnn``\n~\n.. dropdown:: Enable Character-Based CNN TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Character-level CNN TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_pytorch_nlp_model``\n\n.. dropdown:: Enable PyTorch Models for NLP\n\t:open:\n\n\tSpecify whether to enable pretrained PyTorch models and fine-tune them for NLP tasks." }, { "output": " You need to set this to On if you want to use the PyTorch models like BERT for modeling. Only the first text column will be used for modeling with these models." }, { "output": " ``enable_pytorch_nlp_transformer``\n\n.. dropdown:: Enable pre-trained PyTorch Transformers for NLP\n\t:open:\n\n\tSpecify whether to enable pretrained PyTorch models for NLP tasks." }, { "output": " You need to set this to On if you want to use the PyTorch models like BERT for feature engineering (via fitting a linear model on top of pretrained embeddings)." }, { "output": " Notes:\n\n\t- This setting requires an Internet connection. ``pytorch_nlp_pretrained_models``\n~\n.. dropdown:: Select Which Pretrained PyTorch NLP Models to Use\n\t:open:\n\n\tSpecify one or more pretrained PyTorch NLP models to use." }, { "output": " - Models that are not selected by default may not have MOJO support. - Using BERT-like models may result in a longer experiment completion time." }, { "output": " The higher the number of epochs, the higher the run time. This value defaults to 2 and is ignored if TensorFlow models is disabled." }, { "output": " Values equal and above will add all enabled TensorFlow NLP models at the start of the experiment for text-dominated problems when the following NLP expert settings are set to Auto:\n\n\t- Enable word-based CNN TensorFlow models for NLP\n\t- Enable word-based BigRU TensorFlow models for NLP\n\t- Enable character-based CNN TensorFlow models for NLP\n\n\tIf the above transformations are set to ON, this parameter is ignored." }, { "output": " This value defaults to 5. ``pytorch_nlp_fine_tuning_num_epochs``\n\n.. dropdown:: Number of Epochs for Fine-Tuning of PyTorch NLP Models\n\t:open:\n\n\tSpecify the number of epochs used when fine-tuning PyTorch NLP models." }, { "output": " ``pytorch_nlp_fine_tuning_batch_size``\n\n.. dropdown:: Batch Size for PyTorch NLP Models\n\t:open:\n\n\tSpecify the batch size for PyTorch NLP models." }, { "output": " Note: Large models and batch sizes require more memory. ``pytorch_nlp_fine_tuning_padding_length``\n\n.. dropdown:: Maximum Sequence Length for PyTorch NLP Models\n\t:open:\n\n\tSpecify the maximum sequence length (padding length) for PyTorch NLP models." }, { "output": " Note: Large models and padding lengths require more memory. ``pytorch_nlp_pretrained_models_dir``\n~\n.. dropdown:: Path to Pretrained PyTorch NLP Models\n\t:open:\n\n\tSpecify a path to pretrained PyTorch NLP models." }, { "output": " Note that this can be either a path in the local file system (``/path/on/server/to/file.txt``) or an S3 location (``s3://``)." }, { "output": " - You can download the Glove embeddings from `here `__ and specify the local path in this box." }, { "output": " - You can also train your own custom embeddings. Please refer to `this code sample `__ for creating custom embeddings that can be passed on to this option." }, { "output": " .. _tensorflow_nlp_pretrained_s3_access_key_id:\n\n``tensorflow_nlp_pretrained_s3_access_key_id``\n\n.. dropdown:: S3 access key ID to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location\n\t:open:\n\n\tSpecify an S3 access key ID to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location." }, { "output": " .. _tensorflow_nlp_pretrained_s3_secret_access_key:\n\n``tensorflow_nlp_pretrained_s3_secret_access_key``\n\n.. dropdown:: S3 secret access key to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location\n\t:open:\n\n\tSpecify an S3 secret access key to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location." }, { "output": " ``tensorflow_nlp_pretrained_embeddings_trainable``\n\n.. dropdown:: For TensorFlow NLP, Allow Training of Unfrozen Pretrained Embeddings\n\t:open:\n\n\tSpecify whether to allow training of all weights of the neural network graph, including the pretrained embedding layer weights." }, { "output": " All other weights, however, will still be fine-tuned. This is disabled by default. ``text_fraction_for_text_dominated_problem``\n\n.. dropdown:: Fraction of Text Columns Out of All Features to be Considered a Text-Dominanted Problem\n\t:open:\n\n\tSpecify the fraction of text columns out of all features to be considered as a text-dominated problem." }, { "output": " Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable." }, { "output": " This value defaults to 0.3. ``text_transformer_fraction_for_text_dominated_problem``\n\n.. dropdown:: Fraction of Text per All Transformers to Trigger That Text Dominated\n\t:open:\n\n\tSpecify the fraction of text columns out of all features to be considered a text-dominated problem." }, { "output": " ``string_col_as_text_threshold``\n\n.. dropdown:: Threshold for String Columns to be Treated as Text\n\t:open:\n\n\tSpecify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string)." }, { "output": " ``text_transformers_max_vocabulary_size``\n~\n.. dropdown:: Max Size of the Vocabulary for Text Transformers\n\t:open:\n\n\tMax number of tokens created during fitting of Tfidf/Count based text transformers." }, { "output": " .. _quick-start-tables:\n\nQuick-Start Tables by Environment\n-\n\nUse the following tables for Cloud, Server, and Desktop to find the right setup instructions for your environment." }, { "output": " | Min Mem | Refer to Section |\n+=+=+=++\n| NVIDIA DGX-1 | Yes | 128 GB | :ref:`install-on-nvidia-dgx` |\n+-+-+-++\n| Ubuntu with GPUs | Yes | 64 GB | :ref:`install-on-ubuntu-with-gpus` |\n+-+-+-++\n| Ubuntu with CPUs | No | 64 GB | :ref:`install-on-ubuntu-cpus-only` |\n+-+-+-++\n| RHEL with GPUs | Yes | 64 GB | :ref:`install-on-rhel-with-gpus` |\n+-+-+-++\n| RHEL with CPUs | No | 64 GB | :ref:`install-on-rhel-cpus-only` |\n+-+-+-++\n| IBM Power (Minsky) | Yes | 64 GB | Contact sales@h2o.ai |\n+-+-+-++\n\n\nDesktop\n~\n\n+-+-+-+-++\n| Operating System | GPU Support?" }, { "output": " JDBC Setup\n\n\nDriverless AI lets you explore Java Database Connectivity (JDBC) data sources from within the Driverless AI application." }, { "output": " Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Tested Databases\n\n\nThe following databases have been tested for minimal functionality. Note that JDBC drivers that are not included in this list should work with Driverless AI." }, { "output": " See the :ref:`untested-jdbc-driver` section at the end of this chapter for information on how to try out an untested JDBC driver." }, { "output": " This is a JSON/Dictionary String with multiple keys. Note: This requires a JSON key (typically the name of the database being configured) to be associated with a nested JSON that contains the ``url``, ``jarpath``, and ``classpath`` fields." }, { "output": " Double quotation marks (``\"...\"``) must be used to denote keys and values *within* the JSON dictionary, and *outer* quotations must be formatted as either ``\"\"\"``, ``'``, or ``'``." }, { "output": " The following examples show two unique methods for applying outer quotations. - Configuration value applied with the config.toml file:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}\"\"\"\n\n - Configuration value applied with an environment variable:\n \n ::\n \n DRIVERLESS_AI_JDBC_APP_CONFIGS='{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}'\n \n For example:\n \n ::\n \n DRIVERLESS_AI_JDBC_APP_CONFIGS='{\n \"postgres\": {\"url\": \"jdbc:postgresql://192.xxx.x.xxx:aaaa:/name_of_database;user=name_of_user;password=your_password\",\"jarpath\": \"/config/postgresql-xx.x.x.jar\",\"classpath\": \"org.postgresql.Driver\"}, \n \"postgres-local\": {\"url\": \"jdbc:postgresql://123.xxx.xxx.xxx:aaaa/name_of_database\",\"jarpath\": \"/config/postgresql-xx.x.x.jar\",\"classpath\": \"org.postgresql.Driver\"},\n \"ms-sql\": {\"url\": \"jdbc:sqlserver://192.xxx.x.xxx:aaaa;databaseName=name_of_database;user=name_of_user;password=your_password\",\"Username\":\"your_username\",\"passsword\":\"your_password\",\"jarpath\": \"/config/sqljdbc42.jar\",\"classpath\": \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"},\n \"oracle\": {\"url\": \"jdbc:oracle:thin:@192.xxx.x.xxx:aaaa/orclpdb1\",\"jarpath\": \"ojdbc7.jar\",\"classpath\": \"oracle.jdbc.OracleDriver\"},\n \"db2\": {\"url\": \"jdbc:db2://127.x.x.x:aaaaa/name_of_database\",\"jarpath\": \"db2jcc4.jar\",\"classpath\": \"com.ibm.db2.jcc.DB2Driver\"},\n \"mysql\": {\"url\": \"jdbc:mysql://192.xxx.x.xxx:aaaa;\",\"jarpath\": \"mysql-connector.jar\",\"classpath\": \"com.mysql.jdbc.Driver\"},\n \"Snowflake\": {\"url\": \"jdbc:snowflake://.snowflakecomputing.com/?\",\"jarpath\": \"/config/snowflake-jdbc-x.x.x.jar\",\"classpath\": \"net.snowflake.client.jdbc.SnowflakeDriver\"},\n \"Derby\": {\"url\": \"jdbc:derby://127.x.x.x:aaaa/name_of_database\",\"jarpath\": \"/config/derbyclient.jar\",\"classpath\": \"org.apache.derby.jdbc.ClientDriver\"}\n }'\\\n\n- ``jdbc_app_jvm_args``: Extra jvm args for JDBC connector." }, { "output": " - ``jdbc_app_classpath``: Optionally specify an alternative classpath for the JDBC connector. - ``enabled_file_systems``: The file systems you want to enable." }, { "output": " Retrieve the JDBC Driver\n\n\n1. Download JDBC Driver JAR files:\n\n - `Oracle DB `_\n\n - `PostgreSQL `_\n\n - `Amazon Redshift `_\n\n - `Teradata `_\n\n Note: Remember to take note of the driver classpath, as it is needed for the configuration steps (for example, org.postgresql.Driver)." }, { "output": " Copy the driver JAR to a location that can be mounted into the Docker container. Note: The folder storing the JDBC jar file must be visible/readable by the dai process user." }, { "output": " Note that the JDBC connection strings will vary depending on the database that is used. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs,jdbc\" \\\n -e DRIVERLESS_AI_JDBC_APP_CONFIGS='{\"postgres\": \n {\"url\": \"jdbc:postgres://localhost:5432/my_database\", \n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\", \n \"classpath\": \"org.postgresql.Driver\"}}' \\ \n -e DRIVERLESS_AI_JDBC_APP_JVM_ARGS=\"-Xmx2g\" \\\n -p 12345:12345 \\\n -v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure JDBC options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n .. code-block:: bash \n\n enabled_file_systems = \"file, upload, jdbc\"\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgres://localhost:5432/my_database\",\n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the JDBC connector for PostgresQL." }, { "output": " - The configuration requires a JSON key (typically the name of the database being configured) to be associated with a nested JSON that contains the ``url``, ``jarpath``, and ``classpath`` fields." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"upload, file, hdfs, jdbc\"\n\n # Configuration for JDBC Connector." }, { "output": " # Format as a single line without using carriage returns (the following example is formatted for readability)." }, { "output": " # Example:\n # \"\"\"{\n # \"postgres\": {\n # \"url\": \"jdbc:postgresql://ip address:port/postgres\",\n # \"jarpath\": \"/path/to/postgres_driver.jar\",\n # \"classpath\": \"org.postgresql.Driver\"\n # },\n # \"mysql\": {\n # \"url\":\"mysql connection string\",\n # \"jarpath\": \"/path/to/mysql_driver.jar\",\n # \"classpath\": \"my.sql.classpath.Driver\"\n # }\n # }\"\"\"\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgres://localhost:5432/my_database\",\n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n # extra jvm args for jdbc connector\n jdbc_app_jvm_args = \"\"\n\n # alternative classpath for jdbc connector\n jdbc_app_classpath = \"\"\n\n 3." }, { "output": " Adding Datasets Using JDBC\n\n\nAfter the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " Click on the Add Dataset button on the Datasets page. 2. Select JDBC from the list that appears. 3." }, { "output": " 4. The form will populate with the JDBC Database, URL, Driver, and Jar information. Complete the following remaining fields:\n\n - JDBC Username: Enter your JDBC username." }, { "output": " (See the *Notes* section)\n\n - Destination Name: Enter a name for the new dataset. - (Optional) ID Column Name: Enter a name for the ID column." }, { "output": " Notes:\n\n - Do not include the password as part of the JDBC URL. Instead, enter the password in the JDBC Password field." }, { "output": " - Due to resource sharing within Driverless AI, the JDBC Connector is only allocated a relatively small amount of memory." }, { "output": " This ensures that the maximum memory allocation is not exceeded. - If a query that is larger than the maximum memory allocation is made without specifying an ID column, the query will not complete successfully." }, { "output": " Write a SQL Query in the format of the database that you want to query. (See the `Query Examples <#queryexamples>`__ section below.)" }, { "output": " 6. Click the Click to Make Query button to execute the query. The time it takes to complete depends on the size of the data being queried and the network speeds to the database." }, { "output": " .. _queryexamples:\n\nQuery Examples\n\n\nThe following are sample configurations and queries for Oracle DB and PostgreSQL:\n\n.. tabs:: \n .. group-tab:: Oracle DB\n\n 1." }, { "output": " Sample Query:\n\n - Select oracledb from the Select JDBC Connection dropdown menu. - JDBC Username: ``oracleuser``\n - JDBC Password: ``oracleuserpassword``\n - ID Column Name:\n - Query:\n\n ::\n\n SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION\n\n Note: Because this query does not specify an ID Column Name, it will only work for small data." }, { "output": " 3. Click the Click to Make Query button to execute the query. .. group-tab:: PostgreSQL \n\n 1. Configuration:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgresql://localhost:5432/postgresdatabase\", \"jarpath\": \"/home/ubuntu/postgres-artifacts/postgres/Driver.jar\", \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n 2." }, { "output": " - JDBC Username: ``postgres_user``\n - JDBC Password: ``pguserpassword``\n - ID Column Name: ``id``\n - Query:\n\n ::\n\n SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing value 5)\n\n 3." }, { "output": " .. _untested-jdbc-driver:\n\nAdding an Untested JDBC Driver\n\n\nWe encourage you to try out JDBC drivers that are not tested in house." }, { "output": " Download the JDBC jar for your database. 2. Move your JDBC jar file to a location that DAI can access." }, { "output": " Start the Driverless AI Docker image using the JDBC-specific environment variables. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"upload,file,hdfs,s3,recipe_file,jdbc\" \\\n -e DRIVERLESS_AI_JDBC_APP_CONFIGS=\"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\\ \n -e DRIVERLESS_AI_JDBC_APP_JVM_ARGS=\"-Xmx2g\" \\\n -p 12345:12345 \\\n -v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n 1." }, { "output": " 2. Move your JDBC jar file to a location that DAI can access. 3. Configure the Driverless AI config.toml file." }, { "output": " Mount the config.toml file and requisite JAR files into the Docker container. .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " 2. Move your JDBC jar file to a location that DAI can access. 3. Modify the following config.toml settings." }, { "output": " # JSON/Dictionary String with multiple keys. # Format as a single line without using carriage returns (the following example is formatted for readability)." }, { "output": " # Example:\n jdbc_app_configs = \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n\n # optional extra jvm args for jdbc connector\n jdbc_app_jvm_args = \"\"\n\n # optional alternative classpath for jdbc connector\n jdbc_app_classpath = \"\"\n\n 4." }, { "output": " MinIO Setup\n-\n\nThis section provides instructions for configuring Driverless AI to work with `MinIO `__." }, { "output": " Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``minio_endpoint_url``: The endpoint URL that will be used to access MinIO." }, { "output": " - ``minio_secret_access_key``: The MinIO secret access key. - ``minio_skip_cert_verification``: If this is set to true, then MinIO connector will skip certificate verification." }, { "output": " - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly." }, { "output": " It also configures Docker DNS by passing the name and IP of the name node. This lets you reference data stored in MinIO directly using the endpoint URL, for example: http:////datasets/iris.csv." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, minio\"``\n - ``minio_endpoint_url = \"\"``\n - ``minio_access_key_id = \"\"``\n - ``minio_secret_access_key = \"\"``\n - ``minio_skip_cert_verification = \"false\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Native Installs\n\n This example enables the MinIO data connector with authentication by passing an endpoint URL, access key ID, and an access key." }, { "output": " This allows users to reference data stored in MinIO directly using the endpoint URL, for example: http:////datasets/iris.csv." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : MinIO Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, minio\"\n\n # MinIO Connector credentials\n minio_endpoint_url = \"\"\n minio_access_key_id = \"\"\n minio_secret_access_key = \"\"\n minio_skip_cert_verification = \"false\"\n\n 3." }, { "output": " .. _install-on-azure:\n\nInstall on Azure\n\n\nThis section describes how to install the Driverless AI image from Azure." }, { "output": " This is no longer the case as of version 1.5.2. Watch the installation video `here `__." }, { "output": " Environment\n~\n\n++-++-+\n| Provider | Instance Type | Num GPUs | Suitable for |\n++=++=+\n| Azure | Standard_NV6 | 1 | Experimentation |\n| +-++-+\n| | Standard_NV12 | 2 | Experimentation |\n| +-++-+\n| | Standard_NV24 | 4 | Serious use |\n| +-++-+\n| | Standard_NC6 | 1 | Experimentation |\n| +-++-+\n| | Standard_NC12 | 2 | Experimentation |\n| +-++-+\n| | Standard_NC24 | 4 | Serious use |\n++-++-+\n\nAbout the Install\n~\n\n.. include:: linux-rpmdeb-about.frag\n\nInstalling the Azure Instance\n~\n\n1." }, { "output": " 2. Search for and select H2O DriverlessAI in the Marketplace. .. image:: ../images/azure_select_driverless_ai.png\n :align: center\n\n3." }, { "output": " This launches the H2O DriverlessAI Virtual Machine creation process. .. image:: ../images/azure_search_for_dai.png\n :align: center\n\n4." }, { "output": " Enter a name for the VM. b. Select the Disk Type for the VM. Use HDD for GPU instances. c. Enter the name that you will use when connecting to the machine through SSH." }, { "output": " e. Specify the Subscription option. (This should be Pay-As-You-Go.) f. Enter a name unique name for the resource group." }, { "output": " Click OK when you are done. .. image:: ../images/azure_basics_tab.png\n :align: center\n\n5. On the Size tab, select your virtual machine size." }, { "output": " We recommend using an N-Series type, which comes with a GPU. Also note that Driverless AI requires 10 GB of free space in order to run and will stop working of less than 10 GB is available." }, { "output": " Click OK when you are done. .. image:: ../images/azure_vm_size.png\n :align: center\n\n6. On the Settings tab, select or create the Virtual Network and Subnet where the VM is going to be located and then click OK.\n\n .. image:: ../images/azure_settings_tab.png\n :align: center\n\n7." }, { "output": " When the validation passes successfully, click Create to create the VM. .. image:: ../images/azure_summary_tab.png\n :align: center\n\n8." }, { "output": " Select this Driverless AI VM to view the IP address of your newly created machine. 9. Connect to Driverless AI with your browser using the IP address retrieved in the previous step." }, { "output": " To stop the instance: \n\n1. Click the Virtual Machines left menu item. 2. Select the checkbox beside your DriverlessAI virtual machine." }, { "output": " On the right side of the row, click the ... button, then select Stop. (Note that you can then restart this by selecting Start.)" }, { "output": " \nUpgrading the Driverless AI Community Image\n~\n\n.. include:: upgrade-warning.frag\n\nUpgrading from Version 1.2.2 or Earlier\n'\n\nThe following example shows how to upgrade from 1.2.2 or earlier to the current version." }, { "output": " 1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:\n\n .. code-block:: bash\n\n # Set up a directory of the previous version name\n mkdir dai_rel_1.2.2\n\n # Copy the data, log, license, and tmp directories as backup\n cp -a ./data dai_rel_1.2.2/data\n cp -a ./log dai_rel_1.2.2/log\n cp -a ./license dai_rel_1.2.2/license\n cp -a ./tmp dai_rel_1.2.2/tmp\n\n2." }, { "output": " The command below retrieves version 1.2.2:\n\n .. code-block:: bash\n\n wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.2.2-6/x86_64-centos7/dai-docker-centos7-x86_64-1.2.2-9.0.tar.gz\n\n3." }, { "output": " 4. Use the ``docker load`` command to load the image:\n\n .. code-block:: bash\n\n docker load < ami-0c50db5e1999408a7\n\n5." }, { "output": " 6. Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345. Upgrading from Version 1.3.0 or Later\n\n\nThe following example shows how to upgrade from version 1.3.0." }, { "output": " SSH into the IP address of the image instance and copy the existing experiments to a backup location:\n\n .. code-block:: bash\n\n # Set up a directory of the previous version name\n mkdir dai_rel_1.3.0\n\n # Copy the data, log, license, and tmp directories as backup\n cp -a ./data dai_rel_1.3.0/data\n cp -a ./log dai_rel_1.3.0/log\n cp -a ./license dai_rel_1.3.0/license\n cp -a ./tmp dai_rel_1.3.0/tmp\n\n2." }, { "output": " Replace VERSION and BUILD below with the Driverless AI version. .. code-block:: bash\n\n wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/VERSION-BUILD/x86_64/dai-ubi8-centos7-x86_64-VERSION.tar.gz\n\n3." }, { "output": " In the new AMI, locate the DAI_RELEASE file, and edit that file to match the new image tag. 5. Stop and then start Driverless AI." }, { "output": " .. _gbq:\n\nGoogle BigQuery Setup\n#####################\n\nDriverless AI lets you explore Google BigQuery (GBQ) data sources from within the Driverless AI application." }, { "output": " .. note::\n\tThe setup described on this page requires you to enable authentication. Enabling the GCS and/or GBQ connectors causes those file systems to be displayed in the UI, but the GCS and GBQ connectors cannot be used without first enabling authentication." }, { "output": " In the Google Cloud Platform (GCP), create a private key for your service account. To create a private key, click Service Accounts > Keys, and then click the Add Key button." }, { "output": " To finish creating the JSON private key and download it to your local file system, click Create. 2." }, { "output": " 3. Specify the path to the downloaded and mounted ``auth-key.json`` file with the ``gcs_path_to_service_account_json`` config option." }, { "output": " Use ``docker version`` to check which version of Docker you are using. The following sections describe how to enable the GBQ data connector:\n\n- :ref:`gbq-config-toml`\n- :ref:`gbq-environment-variable`\n- :ref:`gbq-workload-identity`\n\n.. _gbq-config-toml:\n\nEnabling GBQ with the config.toml file\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the GBQ data connector with authentication by passing the JSON authentication file." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,gbq\" \\\n -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON=\"/service_account_json.json\" \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v `pwd`/service_account_json.json:/service_account_json.json \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure the GBQ data connector options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, gbq\"``\n - ``gcs_path_to_service_account_json = \"/service_account_json.json\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the GBQ data connector with authentication by passing the JSON authentication file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # file : local file system/server file system\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n enabled_file_systems = \"file, gbq\"\n\n # GCS Connector credentials\n # example (suggested) \"/licenses/my_service_account_json.json\"\n gcs_path_to_service_account_json = \"/service_account_json.json\"\n\n 3." }, { "output": " .. _gbq-environment-variable:\n\nEnabling GBQ by setting an environment variable\n*\n\nThe GBQ data connector can be configured by setting the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as follows:\n\n::\n\n export GOOGLE_APPLICATION_CREDENTIALS=\"SERVICE_ACCOUNT_KEY_PATH\"\n\nIn the preceding example, replace ``SERVICE_ACCOUNT_KEY_PATH`` with the path of the JSON file that contains your service account key." }, { "output": " .. _gbq-workload-identity:\n\nEnabling GBQ by enabling Workload Identity for your GKE cluster\n*\n\nThe GBQ data connector can be configured by enabling Workload Identity for your Google Kubernetes Engine (GKE) cluster." }, { "output": " .. note::\n\tIf Workload Identity is enabled, then the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable does not need to be set." }, { "output": " .. note::\n\tTo run a BigQuery query with Driverless AI, the associated service account must have the following Identity and Access Management (IAM) permissions:\n\n ::\n\n bigquery.jobs.create\n bigquery.tables.create\n bigquery.tables.delete\n bigquery.tables.export\n bigquery.tables.get\n bigquery.tables.getData\n bigquery.tables.list\n bigquery.tables.update\n bigquery.tables.updateData\n storage.buckets.get\n storage.objects.create\n storage.objects.delete\n storage.objects.list\n storage.objects.update\n\n For a list of all Identity and Access Management permissions, refer to the `IAM permissions reference `_ from the official Google Cloud documentation." }, { "output": " Enter BQ Dataset ID with write access to create temporary table: Enter a dataset ID in Google BigQuery that this user has read/write access to." }, { "output": " Note: Driverless AI's connection to GBQ will inherit the top-level directory from the service JSON file." }, { "output": " 2. Enter Google Storage destination bucket: Specify the name of Google Cloud Storage destination bucket." }, { "output": " 3. Enter Name for Dataset to be saved as: Specify a name for the dataset, for example, ``my_file``." }, { "output": " Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute." }, { "output": " 5. (Optional) Specify a project to use with the GBQ connector. This is equivalent to providing ``project`` when using a command-line interface." }, { "output": " Linux Docker Images\n-\n\nTo simplify local installation, Driverless AI is provided as a Docker image for the following system combinations:\n\n+-++-+-+\n| Host OS | Docker Version | Host Architecture | Min Mem |\n+=++=+=+\n| Ubuntu 16.04 or later | Docker CE | x86_64 | 64 GB |\n+-++-+-+\n| RHEL or CentOS 7.4 or later | Docker CE | x86_64 | 64 GB |\n+-++-+-+\n| NVIDIA DGX Registry | | x86_64 | |\n+-++-+-+\n\nNote: CUDA 11.2.2 or later with NVIDIA drivers >= |NVIDIA-driver-ver| is recommended (GPU only)." }, { "output": " For the best performance, including GPU support, use nvidia-docker. For a lower-performance experience without GPUs, use regular docker (with the same docker image)." }, { "output": " For information on how to obtain a license key for Driverless AI, visit https://h2o.ai/o/try-driverless-ai/." }, { "output": " Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, ``tini`` prints a (harmless) warning message." }, { "output": " We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " \nThis section provides instructions for upgrading Driverless AI versions that were installed in a Docker container." }, { "output": " WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded when Driverless AI is upgraded." }, { "output": " - Build MOJO pipelines before upgrading. - Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading." }, { "output": " Before upgrading, be sure to run MLI jobs on models that you want to continue to interpret in future releases." }, { "output": " If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be able to build a MOJO pipeline on that model after upgrading." }, { "output": " Note: Stop Driverless AI if it is still running. Requirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers." }, { "output": " .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\n1." }, { "output": " 2. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n # cd into the new directory\n cd |VERSION-dir|\n\n3." }, { "output": " 4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " Install the Driverless AI AWS Marketplace AMI\n-\n\nA Driverless AI AMI is available in the AWS Marketplace beginning with Driverless AI version 1.5.2." }, { "output": " Environment\n~\n\n++-++-+\n| Provider | Instance Type | Num GPUs | Suitable for |\n++=++=+\n| AWS | p2.xlarge | 1 | Experimentation |\n| +-++-+\n| | p2.8xlarge | 8 | Serious use |\n| +-++-+\n| | p2.16xlarge | 16 | Serious use |\n| +-++-+\n| | p3.2xlarge | 1 | Experimentation |\n| +-++-+\n| | p3.8xlarge | 4 | Serious use |\n| +-++-+\n| | p3.16xlarge | 8 | Serious use |\n| +-++-+\n| | g3.4xlarge | 1 | Experimentation |\n| +-++-+\n| | g3.8xlarge | 2 | Experimentation |\n| +-++-+\n| | g3.16xlarge | 4 | Serious use |\n++-++-+\n\nInstallation Procedure\n\n\n1." }, { "output": " 2. Search for Driverless AI. .. figure:: ../images/aws-marketplace-search.png\n :alt: Search for Driverless AI\n\n3." }, { "output": " .. figure:: ../images/aws-marketplace-versions.png\n :alt: Select version\n\n4. Scroll down to review/edit your region and the selected infrastructure and pricing." }, { "output": " Return to the top and select Continue to Subscribe. .. figure:: ../images/aws-marketplace-continue-to-subscribe.png\n :alt: Continue to subscribe\n\n6. Review the subscription, then click Continue to Configure." }, { "output": " If desired, change the Fullfillment Option, Software Version, and Region. Note that this page also includes the AMI ID for the selected software version." }, { "output": " .. figure:: ../images/aws-marketplace-configure-software.png\n :alt: Configure the software\n\n8. Review the configuration and choose a method for launching Driverless AI." }, { "output": " Scroll down to the bottom of the page and click Launch when you are done. .. figure:: ../images/aws-marketplace-launch.png\n :alt: Launch options\n\nYou will receive a \"Success\" message when the image launches successfully." }, { "output": " 1. Navigate to the `EC2 Console `__. 2. Select your instance. 3. Open another browser and launch Driverless AI by navigating to https://:12345." }, { "output": " Sign in to Driverless AI with the username h2oai and use the AWS InstanceID as the password. You will be prompted to enter your Driverless AI license key when you log in for the first time." }, { "output": " To stop the instance: \n\n1. On the EC2 Dashboard, click the Running Instances link under the Resources section." }, { "output": " Select the instance that you want to stop. 3. In the Actions drop down menu, select Instance State > Stop." }, { "output": " A confirmation page will display. Click Yes, Stop to stop the instance. Upgrading the Driverless AI Marketplace Image\n\n\nNote that the first offering of the Driverless AI Marketplace image was 1.5.2." }, { "output": " Perform the following steps if you are upgrading to a Driverless AI Marketeplace image version greater than 1.5.2." }, { "output": " Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf." }, { "output": " .. code-block:: bash\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " .. _install-on-google-compute:\n\nInstall on Google Compute\n-\n\nDriverless AI can be installed on Google Compute using one of two methods:\n\n- Install the Google Cloud Platform offering." }, { "output": " - Install and Run in a Docker Container on Google Compute Engine. This installs and runs Driverless AI from scratch in a Docker container on Google Compute Engine." }, { "output": " kdb+ Setup\n\n\nDriverless AI lets you explore `kdb+ `__ data sources from within the Driverless AI application." }, { "output": " Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``kdb_user``: (Optional) User name \n- ``kdb_password``: (Optional) User's password\n- ``kdb_hostname``: IP address or host of the KDB server\n- ``kdb_port``: Port on which the kdb+ server is listening\n- ``kdb_app_jvm_args``: (Optional) JVM args for kdb+ distributions (for example, ``-Dlog4j.configuration``)." }, { "output": " - ``kdb_app_classpath``: (Optional) The kdb+ classpath (or other if the jar file is stored elsewhere)." }, { "output": " This must be configured in order for data connectors to function properly. Example 1: Enable kdb+ with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the kdb+ connector without authentication." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,kdb\" \\\n -e DRIVERLESS_AI_KDB_HOSTNAME=\"\" \\\n -e DRIVERLESS_AI_KDB_PORT=\"\" \\\n -p 12345:12345 \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure kdb+ options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, kdb\"``\n - ``kdb_hostname = \"``\n - ``kdb_port = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the kdb+ connector without authentication." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, kdb\"\n\n # KDB Connector credentials\n kdb_hostname = \"\n kdb_port = \"\"\n\n 3." }, { "output": " Example 2: Enable kdb+ with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example provides users credentials for accessing a kdb+ server from Driverless AI." }, { "output": " Note that this example enables kdb+ with no authentication. 1. Configure the Driverless AI config.toml file." }, { "output": " - ``enabled_file_systems = \"file, upload, kdb\"``\n - ``kdb_user = \"\"``\n - ``kdb_password = \"\"``\n - ``kdb_hostname = \"``\n - ``kdb_port = \"\"``\n - ``kdb_app_classpath = \"\"``\n - ``kdb_app_jvm_args = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example provides users credentials for accessing a kdb+ server from Driverless AI." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, kdb\"\n\n # kdb+ Connector credentials\n kdb_user = \"\"\n kdb_password = \"\"\n kdb_hostname = \"\n kdb_port = \"\"\n kdb_app_classpath = \"\"\n kdb_app_jvm_args = \"\"\n\n 3." }, { "output": " Adding Datasets Using kdb+\n\n\nAfter the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " 1. Enter filepath to save query. Enter the local file path for storing your dataset. For example, /home//myfile.csv." }, { "output": " 2. Enter KDB Query: Enter a kdb+ query that you want to execute. Note that the connector will accept any `q qeuries `__." }, { "output": " Data Recipe File Setup\n\n\nDriverless AI lets you explore data recipe file data sources from within the Driverless AI application." }, { "output": " When enabled (default), you will be able to modify datasets that have been added to Driverless AI. (Refer to :ref:`modify_by_recipe` for more information.)" }, { "output": " These steps are provided in case this connector was previously disabled and you want to re-enable it." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Enable Data Recipe File\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the data recipe file data connector." }, { "output": " Note that ``recipe_file`` is enabled in the config.toml file by default. 1. Configure the Driverless AI config.toml file." }, { "output": " - ``enabled_file_systems = \"file, upload, recipe_file\"``\n\n 2. Mount the config.toml file into the Docker container." }, { "output": " Note that ``recipe_file`` is enabled by default. 1. Export the Driverless AI config.toml file or add it to ~/.bashrc." }, { "output": " Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, recipe_file\"\n\n 3." }, { "output": " BlueData DataTap Setup\n\n\nThis section provides instructions for configuring Driverless AI to work with BlueData DataTap. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``dtap_auth_type``: Selects DTAP authentication. Available values are:\n\n - ``noauth``: No authentication needed\n - ``principal``: Authenticate with DataTap with a principal user\n - ``keytab``: Authenticate with a Key tab (recommended)." }, { "output": " - ``keytabimpersonation``: Login with impersonation using a keytab\n\n- ``dtap_config_path``: The location of the DTAP (HDFS) config folder path. This folder can contain multiple config files. Note: The DTAP config file core-site.xml needs to contain DTap FS configuration, for example:\n\n ::\n\n \n \n fs.dtap.impl\n com.bluedata.hadoop.bdfs.Bdfs\n The FileSystem for BlueData dtap: URIs.\n \n \n\n- ``dtap_key_tab_path``: The path of the principal key tab file." }, { "output": " - ``dtap_app_principal_user``: The Kerberos app principal user (recommended). - ``dtap_app_login_user``: The user ID of the current user (for example, user@realm). - ``dtap_app_jvm_args``: JVM args for DTap distributions." }, { "output": " - ``dtap_app_classpath``: The DTap classpath. - ``dtap_init_path``: Specifies the starting DTAP path displayed in the UI of the DTAP browser. - ``enabled_file_systems``: The file systems you want to enable." }, { "output": " Example 1: Enable DataTap with No Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the DataTap data connector and disables authentication. It does not pass any configuration file; however it configures Docker DNS by passing the name and IP of the DTap name node." }, { "output": " (Note: The trailing slash is currently required for directories.) .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,dtap\" \\\n -e DRIVERLESS_AI_DTAP_AUTH_TYPE='noauth' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure DataTap options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n\n 2. Mount the config.toml file into the Docker container." }, { "output": " This allows users to reference data stored in DataTap directly using the name node address, for example: ``dtap://name.node/datasets/iris.csv`` or ``dtap://name.node/datasets/``. (Note: The trailing slash is currently required for directories.)" }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n enabled_file_systems = \"file, dtap\"\n\n 3." }, { "output": " Example 2: Enable DataTap with Keytab-Based Authentication\n\n\nNotes: \n\n- If using Kerberos Authentication, the the time on the Driverless AI server must be in sync with Kerberos server. If the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures." }, { "output": " .. tabs::\n .. group-tab:: Docker Image Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below. - Configures the environment variable ``DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER`` to reference a user for whom the keytab was created (usually in the form of user@realm)." }, { "output": " - Configures the option ``dtap_app_prinicpal_user`` to reference a user for whom the keytab was created (usually in the form of user@realm). 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n - ``dtap_auth_type = \"keytab\"``\n - ``dtap_key_tab_path = \"/tmp/\"``\n - ``dtap_app_principal_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # file : local file system/server file system\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n enabled_file_systems = \"file, dtap\"\n\n # Blue Data DTap connector settings are similar to HDFS connector settings." }, { "output": " If running\n # DAI as a service, then the Kerberos keytab needs to\n # be owned by the DAI user. # keytabimpersonation : Login with impersonation using a keytab\n dtap_auth_type = \"keytab\"\n\n # Path of the principal key tab file\n dtap_key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n dtap_app_principal_user = \"\"\n\n 3." }, { "output": " Example 3: Enable DataTap with Keytab-Based Impersonation\n~\n\nNotes: \n\n- If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server. - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user." }, { "output": " - Configures the ``DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). - Configures the ``DRIVERLESS_AI_DTAP_APP_LOGIN_USER`` variable, which references a user who is being impersonated (usually in the form of user@realm)." }, { "output": " - Configures the ``dtap_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). - Configures the ``dtap_app_login_user`` variable, which references a user who is being impersonated (usually in the form of user@realm)." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n - ``dtap_auth_type = \"keytabimpersonation\"``\n - ``dtap_key_tab_path = \"/tmp/\"``\n - ``dtap_app_principal_user = \"\"``\n - ``dtap_app_login_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " - Configures the ``dtap_app_login_user`` variable, which references a user who is being impersonated (usually in the form of user@realm). 1. Export the Driverless AI config.toml file or add it to ~/.bashrc." }, { "output": " Specify the following configuration options in the config.toml file. ::\n \n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, dtap\"\n\n # Blue Data DTap connector settings are similar to HDFS connector settings." }, { "output": " If running\n # DAI as a service, then the Kerberos keytab needs to\n # be owned by the DAI user. # keytabimpersonation : Login with impersonation using a keytab\n dtap_auth_type = \"keytabimpersonation\"\n\n # Path of the principal key tab file\n dtap_key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n dtap_app_principal_user = \"\"\n \n # Specify the user id of the current user here as user@realm\n dtap_app_login_user = \"\"\n\n 3." }, { "output": " Data Recipe URL Setup\n-\n\nDriverless AI lets you explore data recipe URL data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with data recipe URLs." }, { "output": " (Refer to :ref:`modify_by_recipe` for more information.) Notes:\n\n- This connector is enabled by default. These steps are provided in case this connector was previously disabled and you want to re-enable it." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Enable Data Recipe URL\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the data recipe URL data connector." }, { "output": " Note that ``recipe_url`` is enabled in the config.toml file by default. 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, recipe_url\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Data Recipe URL data connector." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " AutoDoc Settings\n\n\nThis section includes settings that can be used to configure AutoDoc. ``make_autoreport``\n~\n\n.. dropdown:: Make AutoDoc\n\t:open:\n\n\tSpecify whether to create an AutoDoc for the experiment after it has finished running." }, { "output": " ``autodoc_report_name``\n~\n\n.. dropdown:: AutoDoc Name\n\t:open:\n\n\tSpecify a name for the AutoDoc report. This is set to \"report\" by default. ``autodoc_template``\n\n\n.. dropdown:: AutoDoc Template Location\n\t:open:\n\n\tSpecify a path for the AutoDoc template:\n\n\t- To generate a custom AutoDoc template, specify the full path to your custom template." }, { "output": " ``autodoc_output_type``\n~\n\n.. dropdown:: AutoDoc File Output Type\n\t:open:\n\n\tSpecify the AutoDoc output type. Choose from the following file types:\n\n\t- docx (Default)\n\t- md\n\n``autodoc_subtemplate_type``\n\n\n.. dropdown:: AutoDoc SubTemplate Type\n\t:open:\n\n\tSpecify the type of sub-templates to use." }, { "output": " This value defaults to 10. ``autodoc_num_features``\n\n\n.. dropdown:: Number of Top Features to Document\n\t:open:\n\n\tSpecify the number of top features to display in the document. To disable this setting, specify -1." }, { "output": " ``autodoc_min_relative_importance``\n~\n\n.. dropdown:: Minimum Relative Feature Importance Threshold\n\t:open:\n\n\tSpecify the minimum relative feature importance in order for a feature to be displayed. This value must be a float >= 0 and <= 1." }, { "output": " ``autodoc_include_permutation_feature_importance``\n\n\n.. dropdown:: Permutation Feature Importance\n\t:open:\n\n\tSpecify whether to compute permutation-based feature importance. This is disabled by default." }, { "output": " This is set to 1 by default. ``autodoc_feature_importance_scorer``\n~\n\n.. dropdown:: Feature Importance Scorer\n\t:open:\n\n\tSpecify the name of the scorer to be used when calculating feature importance. Leave this setting unspecified to use the default scorer for the experiment." }, { "output": " ``autodoc_pd_max_runtime``\n\n\n.. dropdown:: PDP Max Runtime in Seconds\n\t:open:\n\n\tSpecify the maximum number of seconds Partial Dependency computation can take when generating a report. Set this value to -1 to disable the time limit." }, { "output": " ``autodoc_out_of_range``\n\n\n.. dropdown:: PDP Out of Range\n\t:open:\n\n\tSpecify the number of standard deviations outside of the range of a column to include in partial dependence plots. This shows how the model reacts to data it has not seen before." }, { "output": " ``autodoc_num_rows``\n\n\n.. dropdown:: ICE Number of Rows\n\t:open:\n\n\tSpecify the number of rows to include in PDP and ICE plots if individual rows are not specified. This is set to 0 by default. ``autodoc_population_stability_index``\n\n\n.. dropdown:: Population Stability Index\n\t:open:\n\n\tSpecify whether to include a population stability index if the experiment is a binary classification or regression problem." }, { "output": " ``autodoc_population_stability_index_n_quantiles``\n\n\n.. dropdown:: Population Stability Index Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for the population stability index. This is set to 10 by default." }, { "output": " This value is disabled by default. ``autodoc_prediction_stats_n_quantiles``\n\n\n.. dropdown:: Prediction Statistics Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for prediction statistics." }, { "output": " ``autodoc_response_rate``\n~\n\n.. dropdown:: Response Rates Plot\n\t:open:\n\n\tSpecify whether to include response rates information if the experiment is a binary classification problem. This is disabled by default." }, { "output": " This is set to 10 by default. ``autodoc_gini_plot``\n~\n\n.. dropdown:: Show GINI Plot\n\t:open:\n\n\tSpecify whether to show the GINI plot. This is disabled by default. ``autodoc_enable_shapley_values``\n~\n\n.. dropdown:: Enable Shapley Values\n\t:open:\n\n\tSpecify whether to show Shapley values results in the AutoDoc." }, { "output": " ``autodoc_data_summary_col_num``\n\n\n.. dropdown:: Number of Features in Data Summary Table\n\t:open:\n\n\tSpecify the number of features to be shown in the data summary table. This value must be an integer." }, { "output": " This is set to -1 by default. ``autodoc_list_all_config_settings``\n\n\n.. dropdown:: List All Config Settings\n\t:open:\n\n\tSpecify whether to show all config settings. If this is disabled, only settings that have been changed are listed." }, { "output": " This is disabled by default. ``autodoc_keras_summary_line_length``\n~\n\n.. dropdown:: Keras Model Architecture Summary Line Length\n\t:open:\n\n\tSpecify the line length of the Keras model architecture summary." }, { "output": " To use the default line length, set this value to -1 (default). ``autodoc_transformer_architecture_max_lines``\n\n\n.. dropdown:: NLP/Image Transformer Architecture Max Lines\n\t:open:\n\n\tSpecify the maximum number of lines shown for advanced transformer architecture in the Feature section." }, { "output": " ``autodoc_full_architecture_in_appendix``\n~\n\n.. dropdown:: Appendix NLP/Image Transformer Architecture\n\t:open:\n\n\tSpecify whether to show the full NLP/Image transformer architecture in the appendix. This is disabled by default." }, { "output": " This is disabled by default. ``autodoc_coef_table_num_models``\n~\n\n.. dropdown:: GLM Coefficient Tables Number of Models\n\t:open:\n\n\tSpecify the number of models for which a GLM coefficients table is shown in the AutoDoc." }, { "output": " Set this value to -1 to show tables for all models. This is set to 1 by default. ``autodoc_coef_table_num_folds``\n\n\n.. dropdown:: GLM Coefficient Tables Number of Folds Per Model\n\t:open:\n\n\tSpecify the number of folds per model for which a GLM coefficients table is shown in the AutoDoc." }, { "output": " ``autodoc_coef_table_num_coef``\n~\n\n.. dropdown:: GLM Coefficient Tables Number of Coefficients\n\t:open:\n\n\tSpecify the number of coefficients to show within a GLM coefficients table in the AutoDoc. This is set to 50 by default." }, { "output": " ``autodoc_coef_table_num_classes``\n\n\n.. dropdown:: GLM Coefficient Tables Number of Classes\n\t:open:\n\n\tSpecify the number of classes to show within a GLM coefficients table in the AutoDoc. Set this value to -1 to show all classes." }, { "output": " Snowflake Setup\n- \n\nDriverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Snowflake." }, { "output": " If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``snowflake_account``: The Snowflake account ID\n- ``snowflake_user``: The username for accessing the Snowflake account\n- ``snowflake_password``: The password for accessing the Snowflake account\n- ``enabled_file_systems``: The file systems you want to enable." }, { "output": " Enable Snowflake with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the Snowflake data connector with authentication by passing the ``account``, ``user``, and ``password`` variables." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, snow\"``\n - ``snowflake_account = \"\"``\n - ``snowflake_user = \"\"``\n - ``snowflake_password = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Snowflake data connector with authentication by passing the ``account``, ``user``, and ``password`` variables." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, snow\"\n\n # Snowflake Connector credentials\n snowflake_account = \"\"\n snowflake_user = \"\"\n snowflake_password = \"\"\n\n 3." }, { "output": " Adding Datasets Using Snowflake\n \n\nAfter the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or Drag and Drop) drop-down menu. .. figure:: ../images/add_dataset_dropdown.png\n :alt: Add Dataset\n :height: 338\n :width: 237\n\nSpecify the following information to add your dataset." }, { "output": " Enter Database: Specify the name of the Snowflake database that you are querying. 2. Enter Warehouse: Specify the name of the Snowflake warehouse that you are querying. 3. Enter Schema: Specify the schema of the dataset that you are querying." }, { "output": " Enter Name for Dataset to Be Saved As: Specify a name for the dataset to be saved as. Note that this can only be a CSV file (for example, myfile.csv). 5. Enter Username: (Optional) Specify the username associated with this Snowflake account." }, { "output": " 6. Enter Password: (Optional) Specify the password associated with this Snowflake account. This can be left blank if ``snowflake_password`` was specified in the config.toml when starting Driverless AI; otherwise, this field is required." }, { "output": " Enter Role: (Optional) Specify your role as designated within Snowflake. See https://docs.snowflake.net/manuals/user-guide/security-access-control-overview.html for more information. 8. Enter Region: (Optional) Specify the region of the warehouse that you are querying." }, { "output": " This is optional and can also be left blank if ``snowflake_url`` was specified with a ```` in the config.toml when starting Driverless AI. 9. Enter File Formatting Parameters: (Optional) Specify any additional parameters for formatting your datasets." }, { "output": " (Note: Use only parameters for ``TYPE = CSV``.) For example, if your dataset includes a text column that contains commas, you can specify a different delimiter using ``FIELD_DELIMITER='character'``. Multiple parameters must be separated with spaces:\n\n ::\n\n FIELD_DELIMITER=',' FIELD_OPTIONALLY_ENCLOSED_BY=\"\" SKIP_BLANK_LINES=TRUE\n\n Note: Be sure that the specified delimiter is not also used as a character within a cell; otherwise an error will occur." }, { "output": " To prevent this from occuring, add ``NULL_IF=()`` to the input of FILE FORMATTING PARAMETERS. 10. Enter Snowflake Query: Specify the Snowflake query that you want to execute. 11. When you are finished, select the Click to Make Query button to add the dataset." }, { "output": " .. _install-on-windows:\n\nWindows 10\n\n\nThis section describes how to install, start, stop, and upgrade Driverless AI on a Windows 10 machine. The installation steps assume that you have a license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " Notes:\n\n- GPU support is not available on Windows. - Scoring is not available on Windows. Caution: Installing Driverless AI on Windows 10 is not recommended for serious use. Environment\n~\n\n+-+-+-+-+\n| Operating System | GPU Support?" }, { "output": " Refer to https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/hyper-v-requirements for more information. Docker Image Installation\n~\n\nNotes: \n\n- Be aware that there are known issues with Docker for Windows." }, { "output": " - Consult with your Windows System Admin if \n\n - Your corporate environment does not allow third-part software installs\n - You are running Windows Defender\n - You your machine is not running with ``Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux``." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same. Requirements\n'\n\n- Windows 10 Pro / Enterprise / Education\n- Docker Desktop for Windows 2.2.0.3 (42716)\n\nNote: As of this writing, Driverless AI has only been tested on Docker Desktop for Windows version 2.2.0.3 (42716)." }, { "output": " Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Download, install, and run Docker for Windows from https://docs.docker.com/docker-for-windows/install/. You can verify that Docker is running by typing ``docker version`` in a terminal (such as Windows PowerShell)." }, { "output": " 3. Before running Driverless AI, you must:\n\n - Enable shared access to the C drive. Driverless AI will not be able to see your local data if this is not set. - Adjust the amount of memory given to Docker to be at least 10 GB." }, { "output": " - Optionally adjust the number of CPUs given to Docker. You can adjust these settings by clicking on the Docker whale in your taskbar (look for hidden tasks, if necessary), then selecting Settings > Shared Drive and Settings > Advanced as shown in the following screenshots." }, { "output": " (Docker will restart.) Note that if you cannot make changes, stop Docker and then start Docker again by right clicking on the Docker icon on your desktop and selecting Run as Administrator. .. image:: ../images/windows_docker_menu_bar.png\n :align: center\n :width: 252\n :height: 262\n\n\\\n\n .. image:: ../images/windows_shared_drive_access.png\n :align: center\n :scale: 40%\n\n\\\n\n .. image:: ../images/windows_docker_advanced_preferences.png\n :align: center\n :width: 502\n :height: 326\n\n4." }, { "output": " With Docker running, navigate to the location of your downloaded Driverless AI image. Move the downloaded Driverless AI image to your new directory. 6. Change directories to the new directory, then load the image using the following command:\n\n .. code-block:: bash\n :substitutions:\n \n cd |VERSION-dir|\n docker load -i .\\dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " .. code-block:: bash\n\n md data\n md log\n md license\n md tmp\n\n8. Copy data into the /data directory. The data will be visible inside the Docker container at /data. 9. Run ``docker images`` to find the image tag." }, { "output": " Start the Driverless AI Docker image. Be sure to replace ``path_to_`` below with the entire path to the location of the folders that you created (for example, \"c:/Users/user-name/driverlessai_folder/data\")." }, { "output": " GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini prints a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. .. code-block:: bash\n :substitutions:\n\n docker run pid=host rm shm-size=256m -p 12345:12345 -v c:/path_to_data:/data -v c:/path_to_log:/log -v c:/path_to_license:/license -v c:/path_to_tmp:/tmp h2oai/dai-ubi8-x86_64:|tag|\n\n11." }, { "output": " Add Custom Recipes\n\n\nCustom recipes are Python code snippets that can be uploaded into Driverless AI at runtime like plugins. Restarting Driverless AI is not required. If you do not have a custom recipe, you can select from a number of recipes available in the `Recipes for H2O Driverless AI repository `_." }, { "output": " To add a custom recipe to Driverless AI, click Add Custom Recipe and select one of the following options:\n\n- From computer: Add a custom recipe as a Python or ZIP file from your local file system. - From URL: Add a custom recipe from a URL." }, { "output": " To use this option, your Bitbucket username and password must be provided along with the custom recipe Bitbucket URL. Official Recipes (Open Source)\n\n\nTo access `H2O's official recipes repository `_, click Official Recipes (Open Source)." }, { "output": " If you change the default value of an expert setting from the Expert Settings window, that change is displayed in the TOML configuration editor. For example, if you set the Make MOJO scoring pipeline setting in the Experiment tab to Off, then the line ``make_mojo_scoring_pipeline = \"off\"`` is displayed in the TOML editor." }, { "output": " To confirm your changes, click Save. The experiment preview updates to reflect your specified configuration changes. For a full list of available settings, see :ref:`expert-settings`. .. note::\n\tDo not edit the section below the ``[recipe_activation]`` line." }, { "output": " .. _h2o_drive:\n\n###############\nH2O Drive setup\n###############\n\nH2O Drive is an object-store for `H2O AI Cloud `_. This page describes how to configure Driverless AI to work with H2O Drive." }, { "output": " Description of relevant configuration attributes\n\n\nThe following are descriptions of the relevant configuration attributes when enabling the H2O AI Feature Store data connector:\n\n- ``enabled_file_systems``: A list of file systems you want to enable." }, { "output": " - ``h2o_drive_endpoint_url``: The H2O Drive server endpoint URL. - ``h2o_drive_access_token_scopes``: A space-separated list of OpenID scopes for the access token that are used by the H2O Drive connector." }, { "output": " - ``authentication_method``: The authentication method used by DAI. When enabling the Feature Store data connector, this must be set to OpenID Connect (``authentication_method=\"oidc\"``). For information on setting up OIDC Authentication in Driverless AI, see :ref:`oidc_auth`." }, { "output": " .. _install-on-macosx:\n\nMac OS X\n\n\nThis section describes how to install, start, stop, and upgrade the Driverless AI Docker image on Mac OS X. Note that this uses regular Docker and not NVIDIA Docker." }, { "output": " The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://h2o.ai/o/try-driverless-ai/. Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " Stick to small datasets! For serious use, please use Linux. - Be aware that there are known performance issues with Docker for Mac. More information is available here: https://docs.docker.com/docker-for-mac/osxfs/#technology." }, { "output": " | Min Mem | Suitable for |\n+=+=+=+=+\n| Mac OS X | No | 16 GB | Experimentation |\n+-+-+-+-+\n\nInstalling Driverless AI\n\n\n1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/." }, { "output": " Download and run Docker for Mac from https://docs.docker.com/docker-for-mac/install. 3. Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won't run at all with less than 10 GB of memory." }, { "output": " You will find the controls by clicking on (Docker Whale)->Preferences->Advanced as shown in the following screenshots. (Don't forget to Apply the changes after setting the desired memory value.) .. image:: ../images/macosx_docker_menu_bar.png\n :align: center\n\n.. image:: ../images/macosx_docker_advanced_preferences.png\n :align: center\n :height: 507\n :width: 382\n\n4." }, { "output": " More information is available here: https://docs.docker.com/docker-for-mac/osxfs/#namespaces. .. image:: ../images/macosx_docker_filesharing.png\n :align: center\n :scale: 40%\n\n5. Set up a directory for the version of Driverless AI within the Terminal: \n\n .. code-block:: bash\n :substitutions:\n\n mkdir |VERSION-dir|\n\n6." }, { "output": " 7. Change directories to the new directory, then load the image using the following command:\n\n .. code-block:: bash\n :substitutions:\n\n cd |VERSION-dir|\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n8." }, { "output": " Optionally copy data into the data directory on the host. The data will be visible inside the Docker container at /data. You can also upload data after starting Driverless AI. 10. Run ``docker images`` to find the image tag." }, { "output": " Start the Driverless AI Docker image (still within the new Driverless AI directory). Replace TAG below with the image tag. Note that GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini prints a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. .. code-block:: bash\n :substitutions:\n\n docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n12." }, { "output": " Stopping the Docker Image\n~\n\n.. include:: stop-docker.rst\n\nUpgrading the Docker Image\n\n\nThis section provides instructions for upgrading Driverless AI versions that were installed in a Docker container." }, { "output": " WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded when Driverless AI is upgraded. - Build MLI models before upgrading. - Build MOJO pipelines before upgrading." }, { "output": " If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view MLI on that model after upgrading. Before upgrading, be sure to run MLI jobs on models that you want to continue to interpret in future releases." }, { "output": " If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO pipelines on all desired models and then back up your Driverless AI tmp directory." }, { "output": " Upgrade Steps\n'\n\n1. SSH into the IP address of the machine that is running Driverless AI. 2. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n # cd into the new directory\n cd |VERSION-dir|\n\n3." }, { "output": " 4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " .. _features-settings:\n\nFeatures Settings\n=\n\n``feature_engineering_effort``\n\n\n.. dropdown:: Feature Engineering Effort\n\t:open:\n\n\tSpecify a value from 0 to 10 for the Driverless AI feature engineering effort." }, { "output": " This value defaults to 5. - 0: Keep only numeric features. Only model tuning during evolution. - 1: Keep only numeric features and frequency-encoded categoricals. Only model tuning during evolution. - 2: Similar to 1 but instead just no Text features." }, { "output": " - 3: Similar to 5 but only tuning during evolution. Mixed tuning of features and model parameters. - 4: Similar to 5 but slightly more focused on model tuning. - 5: Balanced feature-model tuning. (Default)\n\t- 6-7: Similar to 5 but slightly more focused on feature engineering." }, { "output": " - 9-10: Similar to 8 but no model tuning during feature evolution. .. _check_distribution_shift:\n\n``check_distribution_shift``\n\n\n.. dropdown:: Data Distribution Shift Detection\n\t:open:\n\n\tSpecify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided)." }, { "output": " Currently, this information is only presented to the user and not acted upon. Shifted features should either be dropped. Or more meaningful aggregate features be created by using them as labels or bins." }, { "output": " .. _check_distribution_shift_drop:\n\n``check_distribution_shift_drop``\n~\n\n.. dropdown:: Data Distribution Shift Detection Drop of Features\n\t:open:\n\n\tSpecify whether to drop high-shift features. This defaults to Auto." }, { "output": " Also see :ref:`drop_features_distribution_shift_threshold_auc ` and :ref:`check_distribution_shift `. .. _drop_features_distribution_shift_threshold_auc:\n\n``drop_features_distribution_shift_threshold_auc``\n\n\n.. dropdown:: Max Allowed Feature Shift (AUC) Before Dropping Feature\n\t:open:\n\n\tSpecify the maximum allowed AUC value for a feature before dropping the feature." }, { "output": " This model includes an AUC value. If this AUC, GINI, or Spearman correlation of the model is above the specified threshold, then Driverless AI will consider it a strong enough shift to drop those features." }, { "output": " .. _check_leakage:\n\n``check_leakage``\n~\n\n.. dropdown:: Data Leakage Detection\n\t:open:\n\n\tSpecify whether to check for data leakage for each feature. Some of the features may contain over predictive power on the target column." }, { "output": " Driverless AI runs a model to determine the predictive power of each feature on the target variable. Then, a simple model is built on each feature with significant variable importance. The models with high AUC (for classification) or R2 score (regression) are reported to the user as potential leak." }, { "output": " This is set to Auto by default. The equivalent config.toml parameter is ``check_leakage``. Also see :ref:`drop_features_leakage_threshold_auc `\n\n.. _drop_features_leakage_threshold_auc:\n\n``drop_features_leakage_threshold_auc``\n~\n\n.. dropdown:: Data Leakage Detection Dropping AUC/R2 Threshold\n\t:open:\n\n\tIf :ref:`Leakage Detection ` is enabled, specify the threshold for dropping features." }, { "output": " This value defaults to 0.999. The equivalent config.toml parameter is ``drop_features_leakage_threshold_auc``. ``leakage_max_data_size``\n~\n\n.. dropdown:: Max Rows X Columns for Leakage\n\t:open:\n\n\tSpecify the maximum number of (rows x columns) to trigger sampling for leakage checks." }, { "output": " ``max_features_importance``\n~\n\n.. dropdown:: Max. num. features for variable importance\n\t:open:\n\n\tSpecify the maximum number of features to use and show in importance tables. For any interpretability higher than 1, transformed or original features with low importance than top max_features_importance features are always removed Feature importances of transformed or original features correspondingly will be pruned." }, { "output": " .. _enable_wide_rules:\n\n``enable_wide_rules``\n~\n\n.. dropdown:: Enable Wide Rules\n\t:open:\n\n\tEnable various rules to handle wide datasets( i.e no. of columns > no. of rows). The default value is \"auto\", that will automatically enable the wide rules when detect that number of columns is greater than number of rows." }, { "output": " Enabling wide data rules sets all ``max_cols``, ``max_orig_*col``, and ``fs_orig*`` tomls to large values, and enforces monotonicity to be disabled unless ``monotonicity_constraints_dict`` is set or default value of ``monotonicity_constraints_interpretability_switch`` is changed." }, { "output": " And enables :ref:`Xgboost Random Forest model ` for modeling. To disable wide rules, set enable_wide_rules to \"off\". For mostly or entirely numeric datasets, selecting only 'OriginalTransformer' for faster speed is recommended (see :ref:`included_transformers `)." }, { "output": " ``orig_features_fs_report``\n~\n\n.. dropdown:: Report Permutation Importance on Original Features\n\t:open:\n\n\tSpecify whether Driverless AI reports permutation importance on original features (represented as normalized change in the chosen metric) in logs and the report file." }, { "output": " ``max_rows_fs``\n~\n\n.. dropdown:: Maximum Number of Rows to Perform Permutation-Based Feature Selection\n\t:open:\n\n\tSpecify the maximum number of rows when performing permutation feature importance, reduced by (stratified) random sampling." }, { "output": " ``max_orig_cols_selected``\n\n\n.. dropdown:: Max Number of Original Features Used\n\t:open:\n\n\tSpecify the maximum number of columns to be selected from an existing set of columns using feature selection." }, { "output": " For categorical columns, the selection is based upon how well target encoding (or frequency encoding if not available) on categoricals and numerics treated as categoricals helps. This is useful to reduce the final model complexity." }, { "output": " ``max_orig_nonnumeric_cols_selected``\n~\n\n.. dropdown:: Max Number of Original Non-Numeric Features\n\t:open:\n\n\tMaximum number of non-numeric columns selected, above which will do feature selection on all features and avoid treating numerical as categorical same as above (max_orig_numeric_cols_selected) but for categorical columns." }, { "output": " This value defaults to 300. ``fs_orig_cols_selected``\n~\n\n.. dropdown:: Max Number of Original Features Used for FS Individual\n\t:open:\n\n\tSpecify the maximum number of features you want to be selected in an experiment." }, { "output": " Additional columns above the specified value add special individual with original columns reduced. ``fs_orig_numeric_cols_selected``\n~\n\n.. dropdown:: Number of Original Numeric Features to Trigger Feature Selection Model Type\n\t:open:\n\n\tThe maximum number of original numeric columns, above which Driverless AI will do feature selection." }, { "output": " A separate individual in the :ref:`genetic algorithm ` is created by doing feature selection by permutation importance on original features. This value defaults to 10,000000. ``fs_orig_nonnumeric_cols_selected``\n\n\n.. dropdown:: Number of Original Non-Numeric Features to Trigger Feature Selection Model Type\n\t:open:\n\n\tThe maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all features." }, { "output": " A separate individual in the :ref:`genetic algorithm ` is created by doing feature selection by permutation importance on original features. This value defaults to 200. ``max_relative_cardinality``\n\n\n.. dropdown:: Max Allowed Fraction of Uniques for Integer and Categorical Columns\n\t:open:\n\n\tSpecify the maximum fraction of unique values for integer and categorical columns." }, { "output": " This value defaults to 0.95. .. _num_as_cat:\n\n``num_as_cat``\n\n\n.. dropdown:: Allow Treating Numerical as Categorical\n\t:open:\n\n\tSpecify whether to allow some numerical features to be treated as categorical features." }, { "output": " The equivalent config.toml parameter is ``num_as_cat``. ``max_int_as_cat_uniques``\n\n\n.. dropdown:: Max Number of Unique Values for Int/Float to be Categoricals\n\t:open:\n\n\tSpecify the number of unique values for integer or real columns to be treated as categoricals." }, { "output": " ``max_fraction_invalid_numeric``\n\n\n.. dropdown:: Max. fraction of numeric values to be non-numeric (and not missing) for a column to still be considered numeric\n\t:open:\n\n\tWhen the fraction of non-numeric (and non-missing) values is less or equal than this value, consider the column numeric." }, { "output": " Note: Replaces non-numeric values with missing values at start of experiment, so some information is lost, but column is now treated as numeric, which can help. Disabled if < 0. .. _nfeatures_max:\n\n``nfeatures_max``\n~\n\n.. dropdown:: Max Number of Engineered Features\n\t:open:\n\n\tSpecify the maximum number of features to be included per model (and in each model within the final model if an ensemble)." }, { "output": " Final ensemble will exclude any pruned-away features and only train on kept features, but may contain a few new features due to fitting on different data view (e.g. new clusters). Final scoring pipeline will exclude any pruned-away features, but may contain a few new features due to fitting on different data view (e.g." }, { "output": " The default value of -1 means no restrictions are applied for this parameter except internally-determined memory and interpretability restrictions. Notes:\n\n\t * If ``interpretability`` > ``remove_scored_0gain_genes_in_postprocessing_above_interpretability`` (see :ref:`config.toml ` for reference), then every GA (:ref:`genetic algorithm `) iteration post-processes features down to this value just after scoring them." }, { "output": " * If ``ngenes_max`` is also not limited, then some individuals will have more genes and features until pruned by mutation or by preparation for final model. * E.g. to generally limit every iteration to exactly 1 features, one must set ``nfeatures_max`` = ``ngenes_max`` =1 and ``remove_scored_0gain_genes_in_postprocessing_above_interpretability`` = 0, but the genetic algorithm will have a harder time finding good features." }, { "output": " .. _ngenes_max:\n\n``ngenes_max``\n\n\n.. dropdown:: Max Number of Genes\n\t:open:\n\n\tSpecify the maximum number of genes (transformer instances) kept per model (and per each model within the final model for ensembles)." }, { "output": " If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric features." }, { "output": " The equivalent config.toml parameter is ``ngenes_max``. ``features_allowed_by_interpretability``\n\n\n.. dropdown:: Limit Features by Interpretability\n\t:open:\n\n\tSpecify whether to limit feature counts with the Interpretability training setting as specified by the ``features_allowed_by_interpretability`` :ref:`config.toml ` setting." }, { "output": " This value defaults to 7. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference. .. _monotonicity-constraints-correlation-threshold:\n\n``monotonicity_constraints_correlation_threshold``\n\n\n.. dropdown:: Correlation Beyond Which to Trigger Monotonicity Constraints (if enabled)\n\t:open:\n\n\tSpecify the threshold of Pearson product-moment correlation coefficient between numerical or encoded transformed feature and target above (below negative for) which to use positive (negative) monotonicity for XGBoostGBM, LightGBM and Decision Tree models." }, { "output": " Note: This setting is only enabled when Interpretability is greater than or equal to the value specified by the :ref:`enable-constraints` setting and when the :ref:`constraints-override` setting is not specified." }, { "output": " ``monotonicity_constraints_log_level``\n\n\n.. dropdown:: Control amount of logging when calculating automatic monotonicity constraints (if enabled)\n\t:open:\n\n\tFor models that support monotonicity constraints, and if enabled, show automatically determined monotonicity constraints for each feature going into the model based on its correlation with the target." }, { "output": " 'medium' shows correlation of positively and negatively constraint features. 'high' shows all correlation values. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference." }, { "output": " Otherwise all features will be in the model. Only active when interpretability >= monotonicity_constraints_interpretability_switch or monotonicity_constraints_dict is provided. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference." }, { "output": " Original numeric features are mapped to the desired constraint:\n\n\t- 1: Positive constraint\n\t- -1: Negative constraint\n\t- 0: Constraint disabled\n\n\tConstraint is automatically disabled (set to 0) for features that are not in this list." }, { "output": " See :ref:`Monotonicity Constraints in Driverless AI ` for reference. .. _max-feature-interaction-depth:\n\n``max_feature_interaction_depth``\n~\n\n.. dropdown:: Max Feature Interaction Depth\n\t:open:\n\n\tSpecify the maximum number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates." }, { "output": " The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + \u2026 featureN). Although certain machine learning algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still generating them may help them (or other algorithms) yield better performance." }, { "output": " Higher values might be able to make more predictive models at the expense of time. This value defaults to 8. Set Max Feature Interaction Depth to 1 to disable any feature interactions ``max_feature_interaction_depth=1``." }, { "output": " To use all features for each transformer, set this to be equal to the number of columns. To do a 50/50 sample and a fixed feature interaction depth of :math:`n` features, set this to -:math:`n`. ``enable_target_encoding``\n\n\n.. dropdown:: Enable Target Encoding\n\t:open:\n\n\tSpecify whether to use Target Encoding when building the model." }, { "output": " A simple example can be to use the mean of the target to replace each unique category of a categorical feature. These type of features can be very predictive but are prone to overfitting and require more memory as they need to store mappings of the unique categories and the target values." }, { "output": " The degree to which GINI is inaccurate is also used to perform fold-averaging of look-up tables instead of using global look-up tables. This is enabled by default. ``enable_lexilabel_encoding``\n~\n\n.. dropdown:: Enable Lexicographical Label Encoding\n\t:open:\n\n\tSpecify whether to enable lexicographical label encoding." }, { "output": " ``enable_isolation_forest``\n~\n\n.. dropdown:: Enable Isolation Forest Anomaly Score Encoding\n\t:open:\n\n\t`Isolation Forest `__ is useful for identifying anomalies or outliers in data." }, { "output": " This split depends on how long it takes to separate the points. Random partitioning produces noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies." }, { "output": " This is disabled by default. ``enable_one_hot_encoding``\n~\n\n.. dropdown:: Enable One HotEncoding\n\t:open:\n\n\tSpecify whether one-hot encoding is enabled. The default Auto setting is only applicable for small datasets and GLMs." }, { "output": " This value defaults to 200. ``drop_constant_columns``\n~\n\n.. dropdown:: Drop Constant Columns\n\t:open:\n\n\tSpecify whether to drop columns with constant values. This is enabled by default. ``drop_id_columns``\n~\n\n.. dropdown:: Drop ID Columns\n\t:open:\n\n\tSpecify whether to drop columns that appear to be an ID." }, { "output": " ``no_drop_features``\n\n\n.. dropdown:: Don't Drop Any Columns\n\t:open:\n\n\tSpecify whether to avoid dropping any columns (original or derived). This is disabled by default. .. _features_to_drop:\n\n``cols_to_drop``\n\n\n.. dropdown:: Features to Drop\n\t:open:\n\n\tSpecify which features to drop." }, { "output": " .. _cols_to_force_in:\n\n``cols_to_force_in``\n~\n\n.. dropdown:: Features to always keep or force in, e.g. \"G1\", \"G2\", \"G3\"\n\t:open:\n\n\tControl over columns to force-in. Forced-in features are handled by the most interpretable transformers allowed by the experiment options, and they are never removed (even if the model assigns 0 importance to them)." }, { "output": " When this field is left empty (default), Driverless AI automatically searches all columns (either at random or based on which columns have high variable importance). ``sample_cols_to_group_by``\n~\n\n.. dropdown:: Sample from Features to Group By\n\t:open:\n\n\tSpecify whether to sample from given features to group by or to always group all features." }, { "output": " ``agg_funcs_for_group_by``\n\n\n.. dropdown:: Aggregation Functions (Non-Time-Series) for Group By Operations\n\t:open:\n\n\tSpecify whether to enable aggregation functions to use for group by operations. Choose from the following (all are selected by default):\n\n\t- mean\n\t- sd\n\t- min\n\t- max\n\t- count\n\n``folds_for_group_by``\n\n\n.. dropdown:: Number of Folds to Obtain Aggregation When Grouping\n\t:open:\n\n\tSpecify the number of folds to obtain aggregation when grouping." }, { "output": " The default value is 5. .. _mutation_mode:\n\n``mutation_mode``\n~\n\n.. dropdown:: Type of Mutation Strategy\n\t:open:\n\n\tSpecify which strategy to apply when performing mutations on transformers. Select from the following:\n\n\t- sample: Sample transformer parameters (Default)\n\t- batched: Perform multiple types of the same transformation together\n\t- full: Perform more types of the same transformation together than the above strategy\n\n``dump_varimp_every_scored_indiv``\n\n\n.. dropdown:: Enable Detailed Scored Features Info\n\t:open:\n\n\tSpecify whether to dump every scored individual's variable importance (both derived and original) to a csv/tabulated/json file." }, { "output": " This is disabled by default. ``dump_trans_timings``\n\n\n.. dropdown:: Enable Detailed Logs for Timing and Types of Features Produced\n\t:open:\n\n\tSpecify whether to dump every scored fold's timing and feature info to a timings.txt file." }, { "output": " ``compute_correlation``\n~\n\n.. dropdown:: Compute Correlation Matrix\n\t:open:\n\n\tSpecify whether to compute training, validation, and test correlation matrixes. When enabled, this setting creates table and heatmap PDF files that are saved to disk." }, { "output": " This is disabled by default. ``interaction_finder_gini_rel_improvement_threshold``\n~\n\n.. dropdown:: Required GINI Relative Improvement for Interactions\n\t:open:\n\n\tSpecify the required GINI relative improvement value for the InteractionTransformer." }, { "output": " If the data is noisy and there is no clear signal in interactions, this value can be decreased to return interactions. This value defaults to 0.5. ``interaction_finder_return_limit``\n~\n\n.. dropdown:: Number of Transformed Interactions to Make\n\t:open:\n\n\tSpecify the number of transformed interactions to make from generated trial interactions." }, { "output": " This value defaults to 5. .. _enable_rapids_transformers:\n\n``enable_rapids_transformers``\n\n\n.. dropdown:: Whether to enable RAPIDS cuML GPU transformers (no mojo)\n\t:open:\n\n\tSpecify whether to enable GPU-based `RAPIDS cuML `__ transformers." }, { "output": " The equivalent config.toml parameter is ``enable_rapids_transformers`` and the default value is False. .. _lowest_allowed_variable_importance:\n\n``varimp_threshold_at_interpretability_10``\n~\n\n.. dropdown:: Lowest allowed variable importance at interpretability 10\n\t:open:\n\n\tSpecify the variable importance below which features are dropped (with the possibility of a replacement being found that's better)." }, { "output": " Set this to a lower value if you're content with having many weak features despite choosing high interpretability, or if you see a drop in performance due to the need for weak features. ``stabilize_fs``\n\n\n.. dropdown:: Whether to take minimum (True) or mean (False) of delta improvement in score when aggregating feature selection scores across multiple folds/depths\n\t:open:\n\n\tWhether to take minimum (True) or mean (False) of delta improvement in score when aggregating feature selection scores across multiple folds/depths." }, { "output": " Feature selection by permutation importance considers the change in score after shuffling a feature, and using minimum operation ignores optimistic scores in favor of pessimistic scores when aggregating over folds." }, { "output": " If interpretability >= config toml value of fs_data_vary_for_interpretability, then half data (or setting of fs_data_frac) is used as another fit, in which case regardless of this toml setting, only features that are kept for all data sizes are kept by feature selection." }, { "output": " Hive Setup\n\n\nDriverless AI lets you explore Hive data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Hive. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. - ``hive_app_configs``: Configuration for Hive Connector." }, { "output": " Important keys include:\n \n - ``hive_conf_path``: The path to Hive configuration. This can have multiple files (e.g. hive-site.xml, hdfs-site.xml, etc.) - ``auth_type``: Specify one of ``noauth``, ``keytab``, or ``keytabimpersonation`` for Kerberos authentication\n - ``keytab_path``: Specify the path to Kerberos keytab to use for authentication (this can be ``\"\"`` if using ``auth_type=\"noauth\"``)\n - ``principal_user``: Specify the Kerberos app principal user (required when using ``auth_type=\"keytab\"`` or ``auth_type=\"keytabimpersonation\"``)\n\nNotes:\n\n- With Hive connectors, it is assumed that DAI is running on the edge node." }, { "output": " missing classes, dependencies, authorization errors). - Ensure the core-site.xml file (from e.g Hadoop conf) is also present in the Hive conf with the rest of the files (hive-site.xml, hdfs-site.xml, etc.)." }, { "output": " ``hadoop.proxyuser.hive.hosts`` & ``hadoop.proxyuser.hive.groups``). - If you have tez as the Hive execution engine, make sure that the required tez dependencies (classpaths, jars, etc.) are available on the DAI node." }, { "output": " The configuration should be JSON/Dictionary String with multiple keys. For example:\n \n ::\n\n \"\"\"{\n \"hive_connection_1\": {\n \"hive_conf_path\": \"/path/to/hive/conf\",\n \"auth_type\": \"one of ['noauth', 'keytab',\n 'keytabimpersonation']\",\n \"keytab_path\": \"/path/to/.keytab\",\n \"principal_user\": \"hive/node1.example.com@EXAMPLE.COM\",\n },\n \"hive_connection_2\": {\n \"hive_conf_path\": \"/path/to/hive/conf_2\",\n \"auth_type\": \"one of ['noauth', 'keytab', \n 'keytabimpersonation']\",\n \"keytab_path\": \"/path/to/.keytab\",\n \"principal_user\": \"hive/node2.example.com@EXAMPLE.COM\",\n }\n }\"\"\"\n\n \\ Note: The expected input of ``hive_app_configs`` is a `JSON string `__." }, { "output": " Depending on how the configuration value is applied, different forms of outer quotations may be required. The following examples show two unique methods for applying outer quotations. - Configuration value applied with the config.toml file:\n\n ::\n\n hive_app_configs = \"\"\"{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}\"\"\"\n\n - Configuration value applied with an environment variable:\n\n ::\n\n DRIVERLESS_AI_HIVE_APP_CONFIGS='{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}'\n\n- ``hive_app_jvm_args``: Optionally specify additional Java Virtual Machine (JVM) args for the Hive connector." }, { "output": " Notes:\n\n - If a custom `JAAS configuration file `__ is needed for your Kerberos setup, use ``hive_app_jvm_args`` to specify the appropriate file:\n\n ::\n\n hive_app_jvm_args = \"-Xmx20g -Djava.security.auth.login.config=/etc/dai/jaas.conf\"\n\n Sample ``jaas.conf`` file:\n ::\n\n com.sun.security.jgss.initiate {\n com.sun.security.auth.module.Krb5LoginModule required\n useKeyTab=true\n useTicketCache=false\n principal=\"hive/localhost@EXAMPLE.COM\" [Replace this line]\n doNotPrompt=true\n keyTab=\"/path/to/hive.keytab\" [Replace this line]\n debug=true;\n };\n\n- ``hive_app_classpath``: Optionally specify an alternative classpath for the Hive connector." }, { "output": " This can be done by specifying each environment variable in the ``nvidia-docker run`` command or by editing the configuration options in the config.toml file and then specifying that file in the ``nvidia-docker run`` command." }, { "output": " Start the Driverless AI Docker Image. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs,hive\" \\\n -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{\"hive_connection_2: {\"hive_conf_path\":\"/etc/hadoop/conf\",\n \"auth_type\":\"keytabimpersonation\",\n \"keytab_path\":\"/etc/dai/steam.keytab\",\n \"principal_user\":\"steam/mr-0xg9.0xdata.loc@H2OAI.LOC\"}}' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -v /path/to/hive/conf:/path/to/hive/conf/in/docker \\\n -v /path/to/hive.keytab:/path/in/docker/hive.keytab \\\n -u $(id -u):${id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure Hive options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Enable and configure the Hive connector in the Driverless AI config.toml file. The Hive connector configuration must be a JSON/Dictionary string with multiple keys. .. code-block:: bash \n\n enabled_file_systems = \"file, hdfs, s3, hive\"\n hive_app_configs = \"\"\"{\"hive_1\": {\"auth_type\": \"keytab\",\n \"key_tab_path\": \"/path/to/hive.keytab\",\n \"hive_conf_path\": \"/path/to/hive-resources\",\n \"principal_user\": \"hive/localhost@EXAMPLE.COM\"}}\"\"\"\n\n 2." }, { "output": " .. code-block:: bash \n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro /\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -v /path/to/hive/conf:/path/to/hive/conf/in/docker \\\n -v /path/to/hive.keytab:/path/in/docker/hive.keytab \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Native Installs\n\n This enables the Hive connector." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\"\n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs, s3, hive\"\n\n \n # Configuration for Hive Connector\n # Note that inputs are similar to configuring HDFS connectivity\n # Important keys:\n # * hive_conf_path - path to hive configuration, may have multiple files." }, { "output": " Required when using auth_type `keytab` or `keytabimpersonation`\n # JSON/Dictionary String with multiple keys. Example:\n # \"\"\"{\n # \"hive_connection_1\": {\n # \"hive_conf_path\": \"/path/to/hive/conf\",\n # \"auth_type\": \"one of ['noauth', 'keytab', 'keytabimpersonation']\",\n # \"keytab_path\": \"/path/to/.keytab\",\n # principal_user\": \"hive/localhost@EXAMPLE.COM\",\n # }\n # }\"\"\"\n #\n hive_app_configs = \"\"\"{\"hive_1\": {\"auth_type\": \"keytab\",\n \"key_tab_path\": \"/path/to/hive.keytab\",\n \"hive_conf_path\": \"/path/to/hive-resources\",\n \"principal_user\": \"hive/localhost@EXAMPLE.COM\"}}\"\"\"\n\n 3." }, { "output": " Adding Datasets Using Hive\n~\n\nAfter the Hive connector is enabled, you can add datasets by selecting Hive from the Add Dataset (or Drag and Drop) drop-down menu. 1. Select the Hive configuraton that you want to use." }, { "output": " Specify the following information to add your dataset. - Hive Database: Specify the name of the Hive database that you are querying. - Hadoop Configuration Path: Specify the path to your Hive configuration file." }, { "output": " - Hive Kerberos Principal: Specify the Hive Kerberos principal. This is required if the Hive Authentication Type is keytabimpersonation. - Hive Authentication Type: Specify the authentication type. This can be noauth, keytab, or keytabimpersonation." }, { "output": " Install on Ubuntu\n-\n\nThis section describes how to install the Driverless AI Docker image on Ubuntu. The installation steps vary depending on whether your system has GPUs or if it is CPU only. Environment\n~\n\n+-+-+-+\n| Operating System | GPUs?" }, { "output": " Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following steps. 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. (Note that the contents of this Docker image include a CentOS kernel and CentOS packages.)" }, { "output": " Install and run Docker on Ubuntu (if not already installed):\n\n .. code-block:: bash\n\n # Install and run Docker on Ubuntu\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \\ \n \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" \n sudo apt-get update\n sudo apt-get install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " More information is available at https://github.com/NVIDIA/nvidia-docker/blob/master/README.md. .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(." }, { "output": " Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver: \n\n .. code-block:: bash\n\n nvidia-smi\n\n5." }, { "output": " Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\n8. Set up the data, log, and license directories on the host machine:\n\n .. code-block:: bash\n\n # Set up the data, log, license, and tmp directories on the host machine (within the new directory)\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n9." }, { "output": " The data will be visible inside the Docker container. 10. Run ``docker images`` to find the image tag. 11. Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag:\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n12." }, { "output": " This section describes how to install and start the Driverless AI Docker image on Ubuntu. Note that this uses ``docker`` and not ``nvidia-docker``. GPU support will not be available. Watch the installation video `here `__." }, { "output": " Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following steps. 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2." }, { "output": " Set up a directory for the version of Driverless AI on the host machine: \n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n4. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " At this point, you can copy data into the data directory on the host machine. The data will be visible inside the Docker container. 7. Run ``docker images`` to find the new image tag. 8. Start the Driverless AI Docker image." }, { "output": " Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n9." }, { "output": " .. _linux-tarsh:\n\nLinux TAR SH\n\n\nThe Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive. This form of installation does not require a privileged user to install or to run." }, { "output": " See those sections for a full list of supported environments. The installation steps assume that you have a valid license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/." }, { "output": " .. note::\n\tTo ensure that :ref:`AutoDoc ` pipeline visualizations are generated correctly on native installations, installing `fontconfig `_ is recommended." }, { "output": " Note that if you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02\n- OpenCL (Required for full LightGBM support on GPU-powered systems)\n- Driverless AI TAR SH, available from https://www.h2o.ai/download/\n\nNote: CUDA 11.2.2 (for GPUs) and cuDNN (required for TensorFlow support on GPUs) are included in the Driverless AI package." }, { "output": " To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used." }, { "output": " Installing Driverless AI\n\n\nRun the following commands to install the Driverless AI TAR SH. .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. chmod 755 |VERSION-tar-lin|\n ./|VERSION-tar-lin|\n\nYou may now cd to the unpacked directory and optionally make changes to config.toml." }, { "output": " ./run-dai.sh\n\nStarting NVIDIA Persistence Mode\n\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Run the following for Centos7/RH7 based systems using yum and x86. .. code-block:: bash\n\n yum -y clean all\n yum -y makecache\n yum -y update\n wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm\n wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.x86_64.rpm\n rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm\n rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm\n clinfo\n\n mkdir -p /etc/OpenCL/vendors && \\\n echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd\n\nLooking at Driverless AI log files\n\n\n.. code-block:: bash\n\n less log/dai.log\n less log/h2o.log\n less log/procsy.log\n less log/vis-server.log\n\nStopping Driverless AI\n\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " By default, all files for Driverless AI are contained within this directory. Upgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers. For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ ." }, { "output": " Upgrade Steps\n'\n\n1. Stop your previous version of Driverless AI. 2. Run the self-extracting archive for the new version of Driverless AI. 3. Port any previous changes you made to your config.toml file to the newly unpacked directory." }, { "output": " Experiment Settings\n=\n\nThis section includes settings that can be used to customize the experiment like total runtime, reproducibility level, pipeline building, feature brain control, adding config.toml settings and more." }, { "output": " This is equivalent to pushing the Finish button once half of the specified time value has elapsed. Note that the overall enforced runtime is only an approximation. This value defaults to 1440, which is the equivalent of a 24 hour approximate overall runtime." }, { "output": " Set this value to 0 to disable this setting. Note that this setting applies to per experiment so if building leaderboard models(n) it will apply to each experiment separately(i.e total allowed runtime will be n*24hrs." }, { "output": " This option preserves experiment artifacts that have been generated for the summary and log zip files while continuing to generate additional artifacts. This value defaults to 10080 mins (7 days). Note that this setting applies to per experiment so if building leaderboard models( say n), it will apply to each experiment separately(i.e total allowed runtime will be n*7days." }, { "output": " Also see :ref:`time_abort `. .. _time_abort:\n\n``time_abort``\n\n\n.. dropdown:: Time to Trigger the 'Abort' Button\n\t:open:\n\n\tIf the experiment is not done by this time, push the abort button." }, { "output": " Also see :ref:`max_runtime_minutes_until_abort ` for control over per experiment abort times. This accepts time in format given by time_abort_format (defaults to %Y-%m-%d %H:%M:%S).This assumes a timezone set by time_abort_timezone in config.toml(defaults to UTC)." }, { "output": " This will apply to the time on a DAI worker that runs the experiments. Similar to :ref:`max_runtime_minutes_until_abort `, time abort will preserves experiment artifacts made so far for summary and log zip files." }, { "output": " .. _pipeline-building-recipe:\n\n``pipeline-building-recipe``\n\n\n.. dropdown:: Pipeline Building Recipe\n\t:open:\n\n\tSpecify the Pipeline Building recipe type (overrides GUI settings). Select from the following:\n\n\t- Auto: Specifies that all models and features are automatically determined by experiment settings, config.toml settings, and the feature engineering effort." }, { "output": " - Only uses GLM or booster as 'giblinear'. - :ref:`Fixed ensemble level ` is set to 0. - :ref:`Feature brain level ` is set to 0. - Max feature interaction depth is set to 1 i.e no interactions." }, { "output": " - Does not use :ref:`distribution shift ` detection. - :ref:`monotonicity_constraints_correlation_threshold ` is set to 0." }, { "output": " - Drops features that are not correlated with target by at least 0.01. See :ref:`monotonicity-constraints-drop-low-correlation-features ` and :ref:`monotonicity-constraints-correlation-threshold `." }, { "output": " - :ref:`Interaction depth ` is set to 1 i.e no multi-feature interactions done to avoid complexity. - No target transformations applied for regression problems i.e sets :ref:`target_transformer ` to 'identity'." }, { "output": " - :ref:`num_as_cat ` feature transformation is disabled. - List of included_transformers\n\t\t\n \t| 'OriginalTransformer', #numeric (no clustering, no interactions, no num->cat)\n \t| 'CatOriginalTransformer', 'RawTransformer','CVTargetEncodeTransformer', 'FrequentTransformer','WeightOfEvidenceTransformer','OneHotEncodingTransformer', #categorical (but no num-cat)\n \t| 'CatTransformer','StringConcatTransformer', # big data only\n \t| 'DateOriginalTransformer', 'DateTimeOriginalTransformer', 'DatesTransformer', 'DateTimeDiffTransformer', 'IsHolidayTransformer', 'LagsTransformer', 'EwmaLagsTransformer', 'LagsInteractionTransformer', 'LagsAggregatesTransformer',#dates/time\n \t| 'TextOriginalTransformer', 'TextTransformer', 'StrFeatureTransformer', 'TextCNNTransformer', 'TextBiGRUTransformer', 'TextCharCNNTransformer', 'BERTTransformer',#text\n \t| 'ImageOriginalTransformer', 'ImageVectorizerTransformer'] #image\n\n \tFor reference also see :ref:`Monotonicity Constraints in Driverless AI `." }, { "output": " - The test set is concatenated with the train set, with the target marked as missing\n\t\t- Transformers that do not use the target are allowed to ``fit_transform`` across the entirety of the train, validation, and test sets." }, { "output": " - nlp_model: Only enable NLP BERT models based on PyTorch to process pure text. To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended. For more information, see :ref:`nlp-in-dai`." }, { "output": " To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended. For more information, see :ref:`nlp-in-dai`. - included_transformers = ['BERTTransformer']\n\t\t- excluded_models = ['TextBERTModel', 'TextMultilingualBERTModel', 'TextXLNETModel', 'TextXLMModel','TextRoBERTaModel', 'TextDistilBERTModel', 'TextALBERTModel', 'TextCamemBERTModel', 'TextXLMRobertaModel']\n\t\t- enable_pytorch_nlp_transformer = 'on'\n\t\t- enable_pytorch_nlp_model = 'off'\n\n\t- image_model: Only enable image models that process pure images (ImageAutoModel)." }, { "output": " For more information, see :ref:`image-model`. Notes:\n\n \t\t- This option disables the :ref:`Genetic Algorithm ` (GA). - Image insights are only available when this option is selected. - image_transformer: Only enable the ImageVectorizer transformer, which processes pure images." }, { "output": " - unsupervised: Only enable unsupervised transformers, models and scorers. :ref:`See ` for reference. - gpus_max: Maximize use of GPUs (e.g. use XGBoost, RAPIDS, Optuna hyperparameter search, etc." }, { "output": " Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings. Changing the pipeline building recipe will reset all pipeline building recipe options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline building recipe rules." }, { "output": " To reset recipe behavior, one can switch between 'auto' and the desired mode. This way the new child experiment will use the default settings for the chosen recipe. .. _enable_genetic_algorithm:\n\n``enable_genetic_algorithm``\n\n\n.. dropdown:: Enable Genetic Algorithm for Selection and Tuning of Features and Models\n\t:open:\n\n\tSpecify whether to enable :ref:`genetic algorithm ` for selection and hyper-parameter tuning of features and models:\n\n\t- auto: Default value is 'auto'." }, { "output": " - on: Driverless AI genetic algorithm is used for feature engineering and model tuning and selection. - Optuna: When 'Optuna' is selected, model hyperparameters are tuned with :ref:`Optuna ` and Driverless AI genetic algorithm is used for feature engineering." }, { "output": " Optuna mode currently only uses Optuna for XGBoost, LightGBM, and CatBoost (custom recipe). If Pruner is enabled, as is default, Optuna mode disables mutations of evaluation metric (eval_metric) so pruning uses same metric across trials to compare." }, { "output": " THe equivalent config.toml parameter is ``enable_genetic_algorithm``. .. _tournament_style:\n\n``tournament_style``\n\n\n.. dropdown:: Tournament Model for Genetic Algorithm\n\t:open:\n\n\tSelect a method to decide which models are best at each iteration." }, { "output": " Choose from the following:\n\n\t- auto: Choose based upon accuracy and interpretability\n\t- uniform: all individuals in population compete to win as best (can lead to all, e.g. LightGBM models in final ensemble, which may not improve ensemble performance due to lack of diversity)\n\t- fullstack: Choose from optimal model and feature types\n\t- feature: individuals with similar feature types compete (good if target encoding, frequency encoding, and other feature sets lead to good results)\n\t- model: individuals with same model type compete (good if multiple models do well but some models that do not do as well still contribute to improving ensemble)\n\n\tFor each case, a round robin approach is used to choose best scores among type of models to choose from." }, { "output": " The tournament is only used to prune-down individuals for, e.g., tuning -> evolution and evolution -> final model. ``make_python_scoring_pipeline``\n\n\n.. dropdown:: Make Python Scoring Pipeline\n\t:open:\n\n\tSpecify whether to automatically build a Python Scoring Pipeline for the experiment." }, { "output": " Select Off to disable the automatic creation of the Python Scoring Pipeline. ``make_mojo_scoring_pipeline``\n\n\n.. dropdown:: Make MOJO Scoring Pipeline\n\t:open:\n\n\tSpecify whether to automatically build a MOJO (Java) Scoring Pipeline for the experiment." }, { "output": " With this option, any capabilities that prevent the creation of the pipeline are dropped. Select Off to disable the automatic creation of the MOJO Scoring Pipeline. Select Auto (default) to attempt to create the MOJO Scoring Pipeline without dropping any capabilities." }, { "output": " When this is set to Auto (default), the MOJO is only used if the number of rows is equal to or below the value specified by ``mojo_for_predictions_max_rows``. .. _reduce_mojo_size:\n\n``reduce_mojo_size``\n~\n.. dropdown:: Attempt to Reduce the Size of the MOJO (Small MOJO)\n\t:open:\n\n\tSpecify whether to attempt to create a small MOJO scoring pipeline when the experiment is being built." }, { "output": " This setting attempts to reduce the mojo size by limiting experiment's maximum :ref:`interaction depth ` to 3, setting :ref:`ensemble level ` to 0 i.e no ensemble model for final pipeline and limiting the :ref:`maximum number of features ` in the model to 200." }, { "output": " This is disabled by default. The equivalent config.toml setting is ``reduce_mojo_size``\n\n``make_pipeline_visualization``\n\n\n.. dropdown:: Make Pipeline Visualization\n\t:open:\n\n\tSpecify whether to create a visualization of the scoring pipeline at the end of an experiment." }, { "output": " Note that the Visualize Scoring Pipeline feature is experimental and is not available for deprecated models. Visualizations are available for all newly created experiments. ``benchmark_mojo_latency``\n\n\n.. dropdown:: Measure MOJO Scoring Latency\n\t:open:\n\n\tSpecify whether to measure the MOJO scoring latency at the time of MOJO creation." }, { "output": " In this case, MOJO scoring latency will be measured if the pipeline.mojo file size is less than 100 MB. ``mojo_building_timeout``\n~\n\n.. dropdown:: Timeout in Seconds to Wait for MOJO Creation at End of Experiment\n\t:open:\n\n\tSpecify the amount of time in seconds to wait for MOJO creation at the end of an experiment." }, { "output": " This value defaults to 1800 sec (30 minutes). ``mojo_building_parallelism``\n~\n\n.. dropdown:: Number of Parallel Workers to Use During MOJO Creation\n\t:open:\n\n\tSpecify the number of parallel workers to use during MOJO creation." }, { "output": " Set this value to -1 (default) to use all physical cores. ``kaggle_username``\n~\n\n.. dropdown:: Kaggle Username\n\t:open:\n\n\tOptionally specify your Kaggle username to enable automatic submission and scoring of test set predictions." }, { "output": " If you don't have a Kaggle account, you can sign up at https://www.kaggle.com. ``kaggle_key``\n\n\n.. dropdown:: Kaggle Key\n\t:open:\n\n\tSpecify your Kaggle API key to enable automatic submission and scoring of test set predictions." }, { "output": " For more information on obtaining Kaggle API credentials, see https://github.com/Kaggle/kaggle-api#api-credentials. ``kaggle_timeout``\n\n\n.. dropdown:: Kaggle Submission Timeout in Seconds\n\t:open:\n\n\tSpecify the Kaggle submission timeout in seconds." }, { "output": " ``min_num_rows``\n\n\n.. dropdown:: Min Number of Rows Needed to Run an Experiment\n\t:open:\n\n\tSpecify the minimum number of rows that a dataset must contain in order to run an experiment. This value defaults to 100." }, { "output": " Note that this setting is only used when the :ref:`reproducible` option is enabled in the experiment:\n\n\t- 1 = Same experiment results for same O/S, same CPU(s), and same GPU(s) (Default)\n\t- 2 = Same experiment results for same O/S, same CPU architecture, and same GPU architecture\n\t- 3 = Same experiment results for same O/S, same CPU architecture (excludes GPUs)\n\t- 4 = Same experiment results for same O/S (best approximation)\n\n\tThis value defaults to 1." }, { "output": " When a seed is defined and the reproducible button is enabled (not by default), the algorithm will behave deterministically. ``allow_different_classes_across_fold_splits``\n\n\n.. dropdown:: Allow Different Sets of Classes Across All Train/Validation Fold Splits\n\t:open:\n\n\t(Note: Applicable for multiclass problems only.)" }, { "output": " This is enabled by default. ``save_validation_splits``\n\n\n.. dropdown:: Store Internal Validation Split Row Indices\n\t:open:\n\n\tSpecify whether to store internal validation split row indices. This includes pickles of (train_idx, valid_idx) tuples (numpy row indices for original training data) for all internal validation folds in the experiment summary ZIP file." }, { "output": " This setting is disabled by default. ``max_num_classes``\n~\n\n.. dropdown:: Max Number of Classes for Classification Problems\n\t:open:\n\n\tSpecify the maximum number of classes to allow for a classification problem." }, { "output": " Memory requirements also increase with a higher number of classes. This value defaults to 200. ``max_num_classes_compute_roc``\n~\n\n.. dropdown:: Max Number of Classes to Compute ROC and Confusion Matrix for Classification Problems\n\n\tSpecify the maximum number of classes to use when computing the ROC and CM." }, { "output": " This value defaults to 200 and cannot be lower than 2. ``max_num_classes_client_and_gui``\n\n\n.. dropdown:: Max Number of Classes to Show in GUI for Confusion Matrix\n\t:open:\n\n\tSpecify the maximum number of classes to show in the GUI for CM, showing first ``max_num_classes_client_and_gui`` labels." }, { "output": " Note that if this value is changed in the config.toml and the server is restarted, then this setting will only modify client-GUI launched diagnostics. To control experiment plots, this value must be changed in the expert settings panel." }, { "output": " Note that this doesn't limit final model calculation. ``use_feature_brain_new_experiments``\n~\n\n.. dropdown:: Whether to Use Feature Brain for New Experiments\n\t:open:\n\n\tSpecify whether to use feature_brain results even if running new experiments." }, { "output": " Even rescoring may be insufficient, so by default this is False. For example, one experiment may have training=external validation by accident, and get high score, and while feature_brain_reset_score='on' means we will rescore, it will have already seen during training the external validation and leak that data as part of what it learned from." }, { "output": " .. _feature_brain1:\n\n``feature_brain_level``\n~\n\n.. dropdown:: Model/Feature Brain Level\n\t:open:\n\n\tSpecify whether to use H2O.ai brain, which enables local caching and smart re-use (checkpointing) of prior experiments to generate useful features and models for new experiments." }, { "output": " When enabled, this will use the H2O.ai brain cache if the cache file:\n\n\t - has any matching column names and types for a similar experiment type\n\t - has classes that match exactly\n\t - has class labels that match exactly\n\t - has basic time series choices that match\n\t - the interpretability of the cache is equal or lower\n\t - the main model (booster) is allowed by the new experiment\n\n\t- -1: Don't use any brain cache (default)\n\t- 0: Don't use any brain cache but still write to cache." }, { "output": " - 1: Smart checkpoint from the latest best individual model. Use case: Want to use the latest matching model. The match may not be precise, so use with caution. - 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time series options identically." }, { "output": " - 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete first iteration." }, { "output": " Tune only if the brain population is of insufficient size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete first iteration. - 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations to get the best scored individuals." }, { "output": " When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain. In addition, the default maximum brain size is 20GB. Both the directory and the maximum size can be changed in the config.toml file." }, { "output": " .. _feature_brain2:\n\n``feature_brain2``\n\n\n.. dropdown:: Feature Brain Save Every Which Iteration\n\t:open:\n\n\tSave feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration 0, to be able to restart/refit with which_iteration_brain >= 0." }, { "output": " - -1: Don't use any brain cache. - 0: Don't use any brain cache but still write to cache. - 1: Smart checkpoint if an old experiment_id is passed in (for example, via running \"resume one like this\" in the GUI)." }, { "output": " (default)\n\t- 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size. - 4: Smart checkpoint like level #2 but for the entire population." }, { "output": " - 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations (starting from resumed experiment if chosen) in order to get the best scored individuals. When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain." }, { "output": " Both the directory and the maximum size can be changed in the config.toml file. .. _feature_brain3:\n\n``feature_brain3``\n\n.. dropdown:: Feature Brain Restart from Which Iteration\n\t:open:\n\n\tWhen performing restart or re-fit of type feature_brain_level with a resumed ID, specify which iteration to start from instead of only last best." }, { "output": " Note: If restarting from a tuning iteration, this will pull in the entire scored tuning population and use that for feature evolution. This value defaults to -1. .. _feature_brain4:\n\n``feature_brain4``\n\n\n.. dropdown:: Feature Brain Refit Uses Same Best Individual\n\t:open:\n\n\tSpecify whether to use the same best individual when performing a refit." }, { "output": " Enabling this setting lets you view the exact same model or feature with only one new feature added. This is disabled by default. .. _feature_brain5:\n\n``feature_brain5``\n\n\n.. dropdown:: Feature Brain Adds Features with New Columns Even During Retraining of Final Model\n\t:open:\n\n\tSpecify whether to add additional features from new columns to the pipeline, even when performing a retrain of the final model." }, { "output": " New data may lead to new dropped features due to shift or leak detection. Disable this to avoid adding any columns as new features so that the pipeline is perfectly preserved when changing data. This is enabled by default." }, { "output": " If this is disabled, the original hyperparameters will be used instead. (Note that this may result in errors.) This is enabled by default. ``min_dai_iterations``\n\n\n.. dropdown:: Min DAI Iterations\n\t:open:\n\n\tSpecify the minimum number of Driverless AI iterations for an experiment." }, { "output": " This value defaults to 0. .. _target_transformer:\n\n``target_transformer``\n\n\n.. dropdown:: Select Target Transformation of the Target for Regression Problems\n\t:open:\n\n\tSpecify whether to automatically select target transformation for regression problems." }, { "output": " Selecting identity_noclip automatically turns off any target transformations. All transformers except for center, standardize, identity_noclip and log_noclip perform clipping to constrain the predictions to the domain of the target in the training data, so avoid them if you want to enable extrapolations." }, { "output": " ``fixed_num_folds_evolution``\n~\n\n.. dropdown:: Number of Cross-Validation Folds for Feature Evolution\n\t:open:\n\n\tSpecify the fixed number of cross-validation folds (if >= 2) for feature evolution. Note that the actual number of allowed folds can be less than the specified value, and that the number of allowed folds is determined at the time an experiment is run." }, { "output": " ``fixed_num_folds``\n~\n\n.. dropdown:: Number of Cross-Validation Folds for Final Model\n\t:open:\n\n\tSpecify the fixed number of cross-validation folds (if >= 2) for the final model. Note that the actual number of allowed folds can be less than the specified value, and that the number of allowed folds is determined at the time an experiment is run." }, { "output": " ``fixed_only_first_fold_model``\n~\n\n.. dropdown:: Force Only First Fold for Models\n\t:open:\n\n\tSpecify whether to force only the first fold for models. Select from Auto (Default), On, or Off. Set \"on\" to force only first fold for models.This is useful for quick runs regardless of data\n\n``feature_evolution_data_size``\n~\n\n.. dropdown:: Max Number of Rows Times Number of Columns for Feature Evolution Data Splits\n\t:open:\n\n\tSpecify the maximum number of rows allowed for feature evolution data splits (not for the final pipeline)." }, { "output": " ``final_pipeline_data_size``\n\n\n.. dropdown:: Max Number of Rows Times Number of Columns for Reducing Training Dataset\n\t:open:\n\n\tSpecify the upper limit on the number of rows times the number of columns for training the final pipeline." }, { "output": " ``max_validation_to_training_size_ratio_for_final_ensemble``\n\n\n.. dropdown:: Maximum Size of Validation Data Relative to Training Data\n\t:open:\n\n\tSpecify the maximum size of the validation data relative to the training data." }, { "output": " Note that final model predictions and scores will always be provided on the full dataset provided. This value defaults to 2.0. ``force_stratified_splits_for_imbalanced_threshold_binary``\n~\n\n.. dropdown:: Perform Stratified Sampling for Binary Classification If the Target Is More Imbalanced Than This\n\t:open:\n\n\tFor binary classification experiments, specify a threshold ratio of minority to majority class for the target column beyond which stratified sampling is performed." }, { "output": " This value defaults to 0.01. You can choose to always perform random sampling by setting this value to 0, or to always perform stratified sampling by setting this value to 1. .. _config_overrides:\n\n``config_overrides``\n\n\n.. dropdown:: Add to config.toml via TOML String\n\t:open:\n\n\tSpecify any additional configuration overrides from the config.toml file that you want to include in the experiment." }, { "output": " Setting this will override all other settings. Separate multiple config overrides with ``\\n``. For example, the following enables Poisson distribution for LightGBM and disables Target Transformer Tuning." }, { "output": " ::\n\n\t params_lightgbm=\\\"{'objective':'poisson'}\\\" \\n target_transformer=identity\n\n\tOr you can specify config overrides similar to the following without having to escape double quotes:\n\n\t::\n\n\t \"\"enable_glm=\"off\" \\n enable_xgboost_gbm=\"off\" \\n enable_lightgbm=\"off\" \\n enable_tensorflow=\"on\"\"\"\n\t \"\"max_cores=10 \\n data_precision=\"float32\" \\n max_rows_feature_evolution=50000000000 \\n ensemble_accuracy_switch=11 \\n feature_engineering_effort=1 \\n target_transformer=\"identity\" \\n tournament_feature_style_accuracy_switch=5 \\n params_tensorflow=\"{'layers': [100, 100, 100, 100, 100, 100]}\"\"\"\n\n\tWhen running the Python client, config overrides would be set as follows:\n\n\t::\n\n\t\tmodel = h2o.start_experiment_sync(\n\t\t dataset_key=train.key,\n\t\t target_col='target',\n\t\t is_classification=True,\n\t\t accuracy=7,\n\t\t time=5,\n\t\t interpretability=1,\n\t\t config_overrides=\"\"\"\n\t\t feature_brain_level=0\n\t\t enable_lightgbm=\"off\"\n\t\t enable_xgboost_gbm=\"off\"\n\t\t enable_ftrl=\"off\"\n\t\t \"\"\"\n\t\t)\n\n``last_recipe``\n~\n\n.. dropdown:: last_recipe\n\t:open:\n\n\tInternal helper to allow memory of if changed recipe\n\n``feature_brain_reset_score``\n~\n\n.. dropdown:: Whether to re-score models from brain cache\n\t:open:\n\n\tSpecify whether to smartly keep score to avoid re-munging/re-training/re-scoring steps brain models ('auto'), always force all steps for all brain imports ('on'), or never rescore ('off')." }, { "output": " 'on' is useful when smart similarity checking is not reliable enough. 'off' is useful when know want to keep exact same features and model for final model refit, despite changes in seed or other behaviors in features that might change the outcome if re-scored before reaching final model." }, { "output": " Can also set refit_same_best_individual True if want exact same best individual (highest scored model+features) to be used regardless of any scoring changes. ``feature_brain_save_every_iteration``\n\n\n.. dropdown:: Feature Brain Save every which iteration\n\t:open:\n\n\tSpecify whether to save feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration 0, to be able to restart/refit with which_iteration_brain >= 0." }, { "output": " ``which_iteration_brain``\n~\n\n.. dropdown:: Feature Brain Restart from which iteration\n\t:open:\n\n\tWhen performing restart or re-fit type feature_brain_level with resumed_experiment_id, choose which iteration to start from, instead of only last best -1 means just use last best." }, { "output": " ``refit_same_best_individual``\n\n\n.. dropdown:: Feature Brain refit uses same best individual\n\t:open:\n\n\tWhen doing re-fit from feature brain, if change columns or features, population of individuals used to refit from may change order of which was best, leading to better result chosen (False case)." }, { "output": " That is, if refit with just 1 extra column and have interpretability=1, then final model will be same features, with one more engineered feature applied to that new original feature. ``restart_refit_redo_origfs_shift_leak``\n\n\n.. dropdown:: For restart-refit, select which steps to do\n\t:open:\n\n\tWhen doing restart or re-fit of experiment from feature brain, sometimes user might change data significantly and then warrant redoing reduction of original features by feature selection, shift detection, and leakage detection." }, { "output": " due to random seed if not setting reproducible mode), leading to changes in features and model that is refitted. By default, restart and refit avoid these steps assuming data and experiment setup have no changed significantly." }, { "output": " In order to ensure exact same final pipeline is fitted, one should also set:\n\n\t- 1) brain_add_features_for_new_columns false\n\t- 2) refit_same_best_individual true\n\t- 3) feature_brain_reset_score 'off'\n\t- 4) force_model_restart_to_defaults false\n\n\tThe score will still be reset if the experiment metric chosen changes, but changes to the scored model and features will be more frozen in place." }, { "output": " In some cases, one might have a new dataset but only want to keep same pipeline regardless of new columns, in which case one sets this to False. For example, new data might lead to new dropped features, due to shift or leak detection." }, { "output": " ``force_model_restart_to_defaults``\n\n\n.. dropdown:: Restart-refit use default model settings if model switches\n\t:open:\n\n\tIf restart/refit and no longer have the original model class available, be conservative and go back to defaults for that model class." }, { "output": " ``dump_modelparams_every_scored_indiv``\n~\n\n.. dropdown:: Enable detailed scored model info\n\t:open:\n\n\tWhether to dump every scored individual's model parameters to csv/tabulated/json file produces files." }, { "output": " [txt, csv, json]\n\n.. _fast-approx-trees:\n\n``fast_approx_num_trees``\n~\n\n.. dropdown:: Max number of trees to use for fast approximation\n\t:open:\n\n\tWhen ``fast_approx=True``, specify the maximum number of trees to use." }, { "output": " .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions. .. _fast-approx-one-fold:\n\n``fast_approx_do_one_fold``\n~\n\n.. dropdown:: Whether to use only one fold for fast approximation\n\t:open:\n\n\tWhen ``fast_approx=True``, specify whether to speed up fast approximation further by using only one fold out of all cross-validation folds." }, { "output": " .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions. .. _fast-approx-one-model:\n\n``fast_approx_do_one_model``\n\n\n.. dropdown:: Whether to use only one model for fast approximation\n\t:open:\n\n\tWhen ``fast_approx=True``, specify whether to speed up fast approximation further by using only one model out of all ensemble models." }, { "output": " .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions. .. _fast-approx-trees-shap:\n\n``fast_approx_contribs_num_trees``\n\n\n.. dropdown:: Maximum number of trees to use for fast approximation when making Shapley predictions\n\t:open:\n\n\tWhen ``fast_approx_contribs=True``, specify the maximum number of trees to use for 'Fast Approximation' in GUI when making Shapley predictions and for AutoDoc/MLI." }, { "output": " .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc. .. _fast-approx-one-fold-shap:\n\n``fast_approx_contribs_do_one_fold``\n\n\n.. dropdown:: Whether to use only one fold for fast approximation when making Shapley predictions\n\t:open:\n\n\tWhen ``fast_approx_contribs=True``, specify whether to speed up ``fast_approx_contribs`` further by using only one fold out of all cross-validation folds for 'Fast Approximation' in GUI when making Shapley predictions and for AutoDoc/MLI." }, { "output": " .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc. .. _fast-approx-one-model-shap:\n\n``fast_approx_contribs_do_one_model``\n~\n\n.. dropdown:: Whether to use only one model for fast approximation when making Shapley predictions\n\t:open:\n\n\tWhen ``fast_approx_contribs=True``, specify whether to speed up ``fast_approx_contribs`` further by using only one model out of all ensemble models for 'Fast Approximation' in GUI when making Shapley predictions and for AutoDoc/MLI." }, { "output": " .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc. .. _autoviz_recommended_transformation:\n\n``autoviz_recommended_transformation``\n\n\n.. dropdown:: Autoviz Recommended Transformations\n\t:open:\n\n\tKey-value pairs of column names and transformations that :ref:`Autoviz ` recommended." }, { "output": " .. _linux-rpms:\n\nLinux RPMs\n\n\nFor Linux machines that will not use the Docker image or DEB, an RPM installation is available for the following environments:\n\n- x86_64 RHEL 7 / RHEL 8\n- CentOS 7 / CentOS 8\n\nThe installation steps assume that you have a license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " - When using systemd, remove the ``dai-minio``, ``dai-h2o``, ``dai-redis``, ``dai-procsy``, and ``dai-vis-server`` services. When upgrading, you can use the following commands to deactivate these services:\n\n ::\n\n systemctl stop dai-minio\n systemctl disable dai-minio\n systemctl stop dai-h2o\n systemctl disable dai-h2o\n systemctl stop dai-redis\n systemctl disable dai-redis\n systemctl stop dai-procsy\n systemctl disable dai-procsy\n systemctl stop dai-vis-server\n systemctl disable dai-vis-server\n\nEnvironment\n~\n\n+-+-+\n| Operating System | Min Mem |\n+=+=+\n| RHEL with GPUs | 64 GB |\n+-+-+\n| RHEL with CPUs | 64 GB |\n+-+-+\n| CentOS with GPUS | 64 GB |\n+-+-+\n| CentOS with CPUs | 64 GB |\n+-+-+\n\nRequirements\n\n\n- RedHat 7/RedHat 8/CentOS 7/CentOS 8\n- NVIDIA drivers >= |NVIDIA-driver-ver| recommended (GPU only)." }, { "output": " About the Install\n~\n\n.. include:: linux-rpmdeb-about.frag\n\nInstalling OpenCL\n~\n\nOpenCL is required for full LightGBM support on GPU-powered systems. To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used." }, { "output": " Installing Driverless AI\n\n\nRun the following commands to install the Driverless AI RPM. .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. sudo rpm -i |VERSION-rpm-lin|\n\n\nNote: For RHEL 7.5, it is necessary to upgrade library glib2:\n\n.. code-block:: bash\n\n sudo yum upgrade glib2\n\nBy default, the Driverless AI processes are owned by the 'dai' user and 'dai' group." }, { "output": " Replace and as appropriate. .. code-block:: bash\n :substitutions:\n\n # Temporarily specify service user and group when installing Driverless AI. # rpm saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files." }, { "output": " Starting Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Start Driverless AI. sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Start Driverless AI." }, { "output": " This command needs to be run every reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\nLooking at Driverless AI log files\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n sudo systemctl status dai-dai\n sudo journalctl -u dai-dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n sudo less /opt/h2oai/dai/log/dai.log\n sudo less /opt/h2oai/dai/log/h2o.log\n sudo less /opt/h2oai/dai/log/procsy.log\n sudo less /opt/h2oai/dai/log/vis-server.log\n\nStopping Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\nUpgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers. For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ ." }, { "output": " Upgrade Steps\n'\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " sudo rpm -U |VERSION-rpm-lin|\n sudo systemctl daemon-reload\n sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped." }, { "output": " sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time. # Upgrade and restart. sudo rpm -U |VERSION-rpm-lin|\n sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nUninstalling Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall. sudo rpm -e dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify." }, { "output": " sudo rpm -e dai\n\nCAUTION! At this point you can optionally completely remove all remaining files, including the database. (This cannot be undone.) .. code-block:: bash\n\n sudo rm -rf /opt/h2oai/dai\n sudo rm -rf /etc/dai\n\nNote: The UID and GID are not removed during the uninstall process." }, { "output": " .. _linux-deb:\n\nLinux DEBs\n\n\nFor Linux machines that will not use the Docker image or RPM, a deb installation is available for x86_64 Ubuntu 16.04/18.04/20.04/22.04. The following installation steps assume that you have a valid license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " - When using systemd, remove the ``dai-minio``, ``dai-h2o``, ``dai-redis``, ``dai-procsy``, and ``dai-vis-server`` services. When upgrading, you can use the following commands to deactivate these services:\n\n ::\n\n systemctl stop dai-minio\n systemctl disable dai-minio\n systemctl stop dai-h2o\n systemctl disable dai-h2o\n systemctl stop dai-redis\n systemctl disable dai-redis\n systemctl stop dai-procsy\n systemctl disable dai-procsy\n systemctl stop dai-vis-server\n systemctl disable dai-vis-server\n\nEnvironment\n~\n\n+-+-+\n| Operating System | Min Mem |\n+=+=+\n| Ubuntu with GPUs | 64 GB |\n+-+-+\n| Ubuntu with CPUs | 64 GB |\n+-+-+\n\nRequirements\n\n\n- Ubuntu 16.04/Ubuntu 18.04/Ubuntu 20.04/Ubuntu 22.04\n- NVIDIA drivers >= |NVIDIA-driver-ver| is recommended (GPU only)." }, { "output": " About the Install\n~\n\n.. include:: linux-rpmdeb-about.frag\n\nStarting NVIDIA Persistence Mode (GPU only)\n~\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every reboot." }, { "output": " .. include:: enable-persistence.rst\n\nInstalling OpenCL\n~\n\nOpenCL is required for full LightGBM support on GPU-powered systems. To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used." }, { "output": " Installing the Driverless AI Linux DEB\n\n\nRun the following commands to install the Driverless AI DEB. .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. sudo dpkg -i |VERSION-deb-lin|\n\nBy default, the Driverless AI processes are owned by the 'dai' user and 'dai' group." }, { "output": " Replace and as appropriate. .. code-block:: bash\n :substitutions:\n\n # Temporarily specify service user and group when installing Driverless AI. # dpkg saves these for systemd in the /etc/dai/User.conf and /etc/dai/Group.conf files." }, { "output": " Starting Driverless AI\n\n\nTo start Driverless AI, use the following command:\n\n.. code-block:: bash\n\n # Start Driverless AI. sudo systemctl start dai\n\nNote: If you don't have systemd, refer to :ref:`linux-tarsh` for install instructions." }, { "output": " sudo systemctl stop dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped." }, { "output": " sudo ps -u dai\n\n\nUpgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers. For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ ." }, { "output": " Upgrade Steps\n'\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " sudo dpkg -i |VERSION-deb-lin|\n sudo systemctl daemon-reload\n sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped." }, { "output": " sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time. If you do not, all previous data will be lost. # Upgrade and restart. sudo dpkg -i |VERSION-deb-lin|\n sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nUninstalling Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall Driverless AI. sudo dpkg -r dai\n\n # Purge Driverless AI. sudo dpkg -P dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped." }, { "output": " sudo ps -u dai\n\n # Uninstall Driverless AI. sudo dpkg -r dai\n\n # Purge Driverless AI. sudo dpkg -P dai\n\nCAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot be undone):\n\n.. code-block:: bash\n\n sudo rm -rf /opt/h2oai/dai\n sudo rm -rf /etc/dai\n\nNote: The UID and GID are not removed during the uninstall process." }, { "output": " However, we DO NOT recommend removing the UID and GID if you plan to re-install Driverless AI. If you remove the UID and GID and then reinstall Driverless AI, the UID and GID will likely be re-assigned to a different (unrelated) user/group in the future; this may cause confusion if there are any remaining files on the filesystem referring to the deleted user or group." }, { "output": " This problem is caused by the font ``NotoColorEmoji.ttf``, which cannot be processed by the Python matplotlib library. A workaround is to disable the font by renaming it. (Do not use fontconfig because it is ignored by matplotlib.)" }, { "output": " .. _install-on-nvidia-dgx:\n\nInstall on NVIDIA GPU Cloud/NGC Registry\n\n\nDriverless AI is supported on the following NVIDIA DGX products, and the installation steps for each platform are the same. - `NVIDIA GPU Cloud `__\n- `NVIDIA DGX-1 `__\n- `NVIDIA DGX-2 `__\n- `NVIDIA DGX Station `__\n\nEnvironment\n~\n\n+++++\n| Provider | GPUs | Min Memory | Suitable for |\n+++++\n| NVIDIA GPU Cloud | Yes | | Serious use |\n+++++\n| NVIDIA DGX-1/DGX-2 | Yes | 128 GB | Serious use |\n+++++\n| NVIDIA DGX Station | Yes | 64 GB | Serious Use | \n+++++\n\nInstalling the NVIDIA NGC Registry\n\n\nNote: These installation instructions assume that you are running on an NVIDIA DGX machine." }, { "output": " 1. Log in to your NVIDIA GPU Cloud account at https://ngc.nvidia.com/registry. (Note that NVIDIA Compute is no longer supported by NVIDIA.) 2. In the Registry > Partners menu, select h2oai-driverless." }, { "output": " At the bottom of the screen, select one of the H2O Driverless AI tags to retrieve the pull command. .. image:: ../images/ngc_select_tag.png\n :align: center\n\n4. On your NVIDIA DGX machine, open a command prompt and use the specified pull command to retrieve the Driverless AI image." }, { "output": " Set up a directory for the version of Driverless AI on the host machine: \n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n6. Set up the data, log, license, and tmp directories on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the directory associated with the selected version of Driverless AI\n cd |VERSION-dir|\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n7." }, { "output": " The data will be visible inside the Docker container. 8. Enable persistence of the GPU. Note that this only needs to be run once. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Run ``docker images`` to find the new image tag. 10. Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n11." }, { "output": " Upgrading Driverless AI\n~\n\nThe steps for upgrading Driverless AI on an NVIDIA DGX system are similar to the installation steps. .. include:: upgrade-warning.frag\n \nNote: Use Ctrl+C to stop Driverless AI if it is still running." }, { "output": " Your host environment must have CUDA 10.0 or later with NVIDIA drivers >= 440.82 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the host environment." }, { "output": " Upgrade Steps\n'\n\n1. On your NVIDIA DGX machine, create a directory for the new Driverless AI version. 2. Copy the data, log, license, and tmp directories from the previous Driverless AI directory into the new Driverless AI directory." }, { "output": " AWS Role-Based Authentication\n~\n\nIn Driverless AI, it is possible to enable role-based authentication via the `IAM role `__." }, { "output": " AWS IAM Setup\n'\n\n1. Create an IAM role. This IAM role should have a Trust Relationship with Principal Trust Entity set to your Account ID. For example: trust relationship for Account ID `524466471676` would look like:\n\n .. code-block:: bash\n\n\t{\n\t \"Version\": \"2012-10-17\",\n\t \"Statement\": [\n\t {\n\t \"Effect\": \"Allow\",\n\t \"Principal\": {\n\t \"AWS\": \"arn:aws:iam::524466471676:root\"\n\t },\n\t \"Action\": \"sts:AssumeRole\"\n\t }\n\t ]\n\t}\n\n .. image:: ../images/aws_iam_role_create.png\n :align: center\n\n2." }, { "output": " Assign the policy to the user. .. image:: ../images/aws_iam_policy_assign.png\n\n4. Test role switching here: https://signin.aws.amazon.com/switchrole. (Refer to https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_roles.html#troubleshoot_roles_cant-assume-role.)" }, { "output": " Resources\n'\n\n1. Granting a User Permissions to Switch Roles: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_permissions-to-switch.html\n2. Creating a Role to Delegate Permissions to an IAM User: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html\n3." }, { "output": " .. _system-settings:\n\nSystem Settings\n=\n\n.. _exclusive_mode:\n\n``exclusive_mode``\n\n\n.. dropdown:: Exclusive level of access to node resources\n\t:open:\n\n\tThere are three levels of access:\n\n\t\t- safe: this level assumes that there might be another experiment also running on same node." }, { "output": " - max: this level assumes that there is absolutly nothing else running on the node except the experiment\n\n\tThe default level is \"safe\" and the equivalent config.toml parameter is ``exclusive_mode``. If :ref:`multinode ` is enabled, this option has no effect, unless worker_remote_processors=1 when it will still be applied." }, { "output": " Changing the exclusive mode will reset all exclusive mode related options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of exclusive mode rules." }, { "output": " To reset mode behavior, one can switch between 'safe' and the desired mode. This way the new child experiment will use the default system resources for the chosen mode. ``max_cores``\n~\n\n.. dropdown:: Number of Cores to Use\n\t:open:\n\n\tSpecify the number of cores to use per experiment." }, { "output": " Lower values can reduce memory usage but might slow down the experiment. This value defaults to 0(all). One can also set it using the environment variable OMP_NUM_THREADS or OPENBLAS_NUM_THREADS (e.g., in bash: 'export OMP_NUM_THREADS=32' or 'export OPENBLAS_NUM_THREADS=32')\n\n``max_fit_cores``\n~\n\n.. dropdown:: Maximum Number of Cores to Use for Model Fit\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's fit call." }, { "output": " This value defaults to 10. .. _use_dask_cluster:\n\n``use_dask_cluster``\n\n\n.. dropdown:: If full dask cluster is enabled, use full cluster\n\t:open:\n\n\tSpecify whether to use full multinode distributed cluster (True) or single-node dask (False)." }, { "output": " E.g. several DGX nodes can be more efficient, if used one DGX at a time for medium-sized data. The equivalent config.toml parameter is ``use_dask_cluster``. ``max_predict_cores``\n~\n\n.. dropdown:: Maximum Number of Cores to Use for Model Predict\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's predict call." }, { "output": " This value defaults to 0(all). ``max_predict_cores_in_dai``\n\n\n.. dropdown:: Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, AutoDoc\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's transform and predict call when doing operations in the Driverless AI MLI GUI and the Driverless AI R and Python clients." }, { "output": " This value defaults to 4. ``batch_cpu_tuning_max_workers``\n\n\n.. dropdown:: Tuning Workers per Batch for CPU\n\t:open:\n\n\tSpecify the number of workers used in CPU mode for tuning. A value of 0 uses the socket count, while a value of -1 uses all physical cores greater than or equal to 1." }, { "output": " ``cpu_max_workers``\n~\n.. dropdown:: Number of Workers for CPU Training\n\t:open:\n\n\tSpecify the number of workers used in CPU mode for training:\n\n\t- 0: Use socket count (Default)\n\t- -1: Use all physical cores >= 1 that count\n\n.. _num_gpus_per_experiment:\n\n``num_gpus_per_experiment``\n~\n\n.. dropdown:: #GPUs/Experiment\n\t:open:\n\n\tSpecify the number of GPUs to use per experiment." }, { "output": " Must be at least as large as the number of GPUs to use per model (or -1). In multinode context when using dask, this refers to the per-node value. ``min_num_cores_per_gpu``\n~\n.. dropdown:: Num Cores/GPU\n\t:open:\n\n\tSpecify the number of CPU cores per GPU." }, { "output": " This value defaults to 2. .. _num-gpus-per-model:\n\n``num_gpus_per_model``\n\n.. dropdown:: #GPUs/Model\n\t:open:\n\n\tSpecify the number of GPUs to user per model. The equivalent config.toml parameter is ``num_gpus_per_model`` and the default value is 1." }, { "output": " Setting this parameter to -1 means use all GPUs per model. In all cases, XGBoost tree and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1 GPU/model and run multiple models on multiple GPUs." }, { "output": " Rulefit uses GPUs for parts involving obtaining the tree using LightGBM. In multinode context when using dask, this parameter refers to the per-node value. .. _num-gpus-for-prediction:\n\n``num_gpus_for_prediction``\n~\n\n.. dropdown:: Num." }, { "output": " If ``predict`` or ``transform`` are called in the same process as ``fit``/``fit_transform``, the number of GPUs will match. New processes will use this count for applicable models and transformers. Note that enabling ``tensorflow_nlp_have_gpus_in_production`` will override this setting for relevant TensorFlow NLP transformers." }, { "output": " Note: When GPUs are used, TensorFlow, PyTorch models and transformers, and RAPIDS always predict on GPU. And RAPIDS requires Driverless AI python scoring package also to be used on GPUs. In multinode context when using dask, this refers to the per-node value." }, { "output": " If using CUDA_VISIBLE_DEVICES=... to control GPUs (preferred method), gpu_id=0 is the\n\tfirst in that restricted list of devices. For example, if ``CUDA_VISIBLE_DEVICES='4,5'`` then ``gpu_id_start=0`` will refer to device #4." }, { "output": " This is because the underlying algorithms do not support arbitrary gpu ids, only sequential ids, so be sure to set this value correctly to avoid overlap across all experiments by all users. More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation\n\tNote that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile GPUs." }, { "output": " For actual use beyond this value, system will start to have slow-down issues. THe default value is 3. ``max_max_dt_threads_munging``\n\n.. dropdown:: Maximum of threads for datatable for munging\n\t:open:\n\n\tMaximum number of threads for datatable for munging." }, { "output": " This option is primarily useful for avoiding model building failure due to GPU Out Of Memory (OOM). Currently is applicable to all non-dask XGBoost models (i.e. GLMModel, XGBoostGBMModel, XGBoostDartModel, XGBoostRFModel),during normal fit or when using Optuna." }, { "output": " For example, If XGBoost runs out of GPU memory, this is detected, and (regardless of setting of skip_model_failures), we perform feature selection using XGBoost on subsets of features. The dataset is progressively reduced by factor of 2 with more models to cover all features." }, { "output": " Then all sub-models are used to estimate variable importance by absolute information gain, in order to decide which features to include. Finally, a single model with the most important features is built using the feature count that did not lead to OOM." }, { "output": " - Reproducibility is not guaranteed when this option is turned on. Hence if user enables reproducibility for the experiment, 'auto' automatically sets this option to 'off'. This is because the condition of running OOM can change for same experiment seed." }, { "output": " Also see :ref:`reduce_repeats_when_failure ` and :ref:`fraction_anchor_reduce_features_when_failure `\n\n.. _reduce_repeats_when_failure:\n\n``reduce_repeats_when_failure``\n~\n\n.. dropdown:: Number of repeats for models used for feature selection during failure recovery\n\t:open:\n\n\tWith :ref:`allow_reduce_features_when_failure `, this controls how many repeats of sub-models are used for feature selection." }, { "output": " More repeats can lead to higher accuracy. The cost of this option is proportional to the repeat count. The default value is 1. .. _fraction_anchor_reduce_features_when_failure:\n\n``fraction_anchor_reduce_features_when_failure``\n\n\n.. dropdown:: Fraction of features treated as anchor for feature selection during failure recovery\n\t:open:\n\n\tWith :ref:`allow_reduce_features_when_failure `, this controls the fraction of features treated as an anchor that are fixed for all sub-models." }, { "output": " For tuning and evolution, the probability depends upon any prior importance (if present) from other individuals, while final model uses uniform probability for anchor features. The default fraction is 0.1." }, { "output": " See allow_reduce_features_when_failure. ``lightgbm_reduce_on_errors_list``\n\n\n.. dropdown:: Errors From LightGBM That Trigger Reduction of Features\n\t:open:\n\n\tError strings from LightGBM that are used to trigger re-fit on reduced sub-models." }, { "output": " ``num_gpus_per_hyperopt_dask``\n\n\n.. dropdown:: GPUs / HyperOptDask\n\t:open:\n\n\tSpecify the number of GPUs to use per model hyperopt training task. To use all GPUs, set this to -1. For example, when this is set to -1 and there are 4 GPUs available, all of them can be used for the training of a single model across a Dask cluster." }, { "output": " In multinode context, this refers to the per-node value. ``detailed_traces``\n~\n\n.. dropdown:: Enable Detailed Traces\n\t:open:\n\n\tSpecify whether to enable detailed tracing in Driverless AI trace when running an experiment." }, { "output": " ``debug_log``\n~\n\n.. dropdown:: Enable Debug Log Level\n\t:open:\n\n\tIf enabled, the log files will also include debug logs. This is disabled by default. ``log_system_info_per_experiment``\n\n\n.. dropdown:: Enable Logging of System Information for Each Experiment\n\t:open:\n\n\tSpecify whether to include system information such as CPU, GPU, and disk space at the start of each experiment log." }, { "output": " The F0.5 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F0.5 score gives more weight to precision than to recall." }, { "output": " For example, if your use case is to predict which products you will run out of, you may consider False Positives worse than False Negatives. In this case, you want your predictions to be very precise and only capture the products that will definitely run out." }, { "output": " F05 equation:\n\n.. math::\n\n F0.5 = 1.25 \\;\\Big(\\; \\frac{(precision) \\; (recall)}{((0.25) \\; (precision)) + recall}\\; \\Big)\n\nWhere:\n\n- *precision* is the positive observations (true positives) the model correctly identified from all the observations it labeled as positive (the true positives + the false positives)." }, { "output": " S3 Setup\n\n\nDriverless AI lets you explore S3 data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with S3. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``aws_access_key_id``: The S3 access key ID\n- ``aws_secret_access_key``: The S3 access key\n- ``aws_role_arn``: The Amazon Resource Name\n- ``aws_default_region``: The region to use when the aws_s3_endpoint_url option is not set." }, { "output": " - ``aws_s3_endpoint_url``: The endpoint URL that will be used to access S3. - ``aws_use_ec2_role_credentials``: If set to true, the S3 Connector will try to to obtain credentials associated with the role attached to the EC2 instance." }, { "output": " - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Example 1: Enable S3 with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n\tThis example enables the S3 data connector and disables authentication." }, { "output": " This allows users to reference data stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv. .. code-block:: bash\n\t :substitutions:\n\n\t nvidia-docker run \\\n\t\t\tshm-size=256m \\\n\t\t\tadd-host name.node:172.16.2.186 \\\n\t\t\t-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,s3\" \\\n\t\t\t-p 12345:12345 \\\n\t\t\tinit -it rm \\\n\t\t\t-v /tmp/dtmp/:/tmp \\\n\t\t\t-v /tmp/dlog/:/log \\\n\t\t\t-v /tmp/dlicense/:/license \\\n\t\t\t-v /tmp/ddata/:/data \\\n\t\t\t-u $(id -u):$(id -g) \\\n\t\t\th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n\tThis example shows how to configure S3 options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, s3\"``\n\n\t2. Mount the config.toml file into the Docker container. .. code-block:: bash\n\t \t :substitutions:\n\n\t\t nvidia-docker run \\\n\t\t \tpid=host \\\n\t\t \tinit \\\n\t\t \trm \\\n\t\t \tshm-size=256m \\\n\t\t \tadd-host name.node:172.16.2.186 \\\n\t\t \t-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n\t\t \t-p 12345:12345 \\\n\t\t \t-v /local/path/to/config.toml:/path/in/docker/config.toml \\\n\t\t \t-v /etc/passwd:/etc/passwd:ro \\\n\t\t \t-v /etc/group:/etc/group:ro \\\n\t\t \t-v /tmp/dtmp/:/tmp \\\n\t\t \t-v /tmp/dlog/:/log \\\n\t\t \t-v /tmp/dlicense/:/license \\\n\t\t \t-v /tmp/ddata/:/data \\\n\t\t \t-u $(id -u):$(id -g) \\\n\t\t \th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n\tThis example enables the S3 data connector and disables authentication." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n\t ::\n\n\t # DEB and RPM\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n\t # TAR SH\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n\t2." }, { "output": " ::\n\n\t\t# File System Support\n\t\t# upload : standard upload feature\n\t\t# file : local file system/server file system\n\t\t# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n\t\t# dtap : Blue Data Tap file system, remember to configure the DTap section below\n\t\t# s3 : Amazon S3, optionally configure secret and access key below\n\t\t# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n\t\t# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n\t\t# minio : Minio Cloud Storage, remember to configure secret and access key below\n\t\t# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n\t\t# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n\t\t# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n\t\t# jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n\t\t# recipe_url: load custom recipe from URL\n\t\t# recipe_file: load custom recipe from local file system\n\t\tenabled_file_systems = \"file, s3\"\n\n\t3. Save the changes when you are done, then stop/restart Driverless AI." }, { "output": " It also configures Docker DNS by passing the name and IP of the S3 name node. This allows users to reference data stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, s3\"``\n\t - ``aws_access_key_id = \"\"``\n\t - ``aws_secret_access_key = \"\"``\n\n\t2." }, { "output": " .. code-block:: bash\n\t \t:substitutions:\n\n\t\t nvidia-docker run \\\n\t\t \tpid=host \\\n\t\t \tinit \\\n\t\t \trm \\\n\t\t \tshm-size=256m \\\n\t\t \tadd-host name.node:172.16.2.186 \\\n\t\t \t-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n\t\t \t-p 12345:12345 \\\n\t\t \t-v /local/path/to/config.toml:/path/in/docker/config.toml \\\n\t\t \t-v /etc/passwd:/etc/passwd:ro \\\n\t\t \t-v /etc/group:/etc/group:ro \\\n\t\t \t-v /tmp/dtmp/:/tmp \\\n\t\t \t-v /tmp/dlog/:/log \\\n\t\t \t-v /tmp/dlicense/:/license \\\n\t\t \t-v /tmp/ddata/:/data \\\n\t\t \t-u $(id -u):$(id -g) \\\n\t\t \th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n\tThis example enables the S3 data connector with authentication by passing an S3 access key ID and an access key." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n\t ::\n\n\t # DEB and RPM\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n\t # TAR SH\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n\t2." }, { "output": " ::\n\n\t\t# File System Support\n\t\t# upload : standard upload feature\n\t\t# file : local file system/server file system\n\t\t# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n\t\t# dtap : Blue Data Tap file system, remember to configure the DTap section below\n\t\t# s3 : Amazon S3, optionally configure secret and access key below\n\t\t# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n\t\t# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n\t\t# minio : Minio Cloud Storage, remember to configure secret and access key below\n\t\t# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n\t\t# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n\t\t# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n\t\t# jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n\t\t# recipe_url: load custom recipe from URL\n\t\t# recipe_file: load custom recipe from local file system\n\t\tenabled_file_systems = \"file, s3\"\n\n\t\t# S3 Connector credentials\n\t\taws_access_key_id = \"\"\n\t\taws_secret_access_key = \"\"\n\n\t3." }, { "output": " .. _image-settings:\n\nImage Settings\n\n\n``enable_tensorflow_image``\n~\n.. dropdown:: Enable Image Transformer for Processing of Image Data\n\t:open:\n\n\tSpecify whether to use pretrained deep learning models for processing of image data as part of the feature engineering pipeline." }, { "output": " This is enabled by default. .. _tensorflow_image_pretrained_models:\n\n``tensorflow_image_pretrained_models``\n\n\n.. dropdown:: Supported ImageNet Pretrained Architectures for Image Transformer\n\t:open:\n\n\tSpecify the supported `ImageNet `__ pretrained architectures for image transformer." }, { "output": " If an internet connection is not available, non-default models must be downloaded from http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/dai_image_models_1_10.zip and extracted into ``tensorflow_image_pretrained_models_dir``." }, { "output": " In this case, embeddings from the different architectures are concatenated together (in a single embedding). ``tensorflow_image_vectorization_output_dimension``\n~\n.. dropdown:: Dimensionality of Feature Space Created by Image Transformer\n\t:open:\n\n\tSpecify the dimensionality of the feature (embedding) space created by Image Transformer." }, { "output": " .. _image-model-fine-tune:\n\n``tensorflow_image_fine_tune``\n\n.. dropdown:: Enable Fine-Tuning of the Pretrained Models Used for the Image Transformer\n\t:open:\n\n\tSpecify whether to enable fine-tuning of the ImageNet pretrained models used for the Image Transformer." }, { "output": " ``tensorflow_image_fine_tuning_num_epochs``\n~\n.. dropdown:: Number of Epochs for Fine-Tuning Used for the Image Transformer\n\t:open:\n\n\tSpecify the number of epochs for fine-tuning ImageNet pretrained models used for the Image Transformer." }, { "output": " ``tensorflow_image_augmentations``\n\n.. dropdown:: List of Augmentations for Fine-Tuning Used for the Image Transformer\n\t:open:\n\n\tSpecify the list of possible image augmentations to apply while fine-tuning the ImageNet pretrained models used for the Image Transformer." }, { "output": " ``tensorflow_image_batch_size``\n~\n.. dropdown:: Batch Size for the Image Transformer\n\t:open:\n\n\tSpecify the batch size for the Image Transformer. By default, the batch size is set to -1 (selected automatically)." }, { "output": " ``image_download_timeout``\n\n.. dropdown:: Image Download Timeout in Seconds\n\t:open:\n\n\tWhen providing images through URLs, specify the maximum number of seconds to wait for an image to download. This value defaults to 60 sec." }, { "output": " This value defaults to 0.1. ``string_col_as_image_min_valid_types_fraction``\n\n.. dropdown:: Minimum Fraction of Images That Need to Be of Valid Types for Image Column to Be Used\n\t:open:\n\n\tSpecify the fraction of unique image URIs that need to have valid endings (as defined by ``string_col_as_image_valid_types``) for a string column to be considered as image data." }, { "output": " ``tensorflow_image_use_gpu``\n\n.. dropdown:: Enable GPU(s) for Faster Transformations With the Image Transformer\n\t:open:\n\n\tSpecify whether to use any available GPUs to transform images into embeddings with the Image Transformer." }, { "output": " Install on RHEL\n-\n\nThis section describes how to install the Driverless AI Docker image on RHEL. The installation steps vary depending on whether your system has GPUs or if it is CPU only. Environment\n~\n\n+-+-+-+\n| Operating System | GPUs?" }, { "output": " These links describe how to disable automatic updates and specific package updates. This is necessary in order to prevent a mismatch between the NVIDIA driver and the kernel, which can lead to the GPUs failures." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same. .. note::\n\tAs of this writing, Driverless AI has been tested on RHEL versions 7.4, 8.3, and 8.4." }, { "output": " Once you are logged in, perform the following steps. 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.com/engine/installation/linux/docker-ee/rhel/." }, { "output": " .. code-block:: bash\n\n sudo yum install -y yum-utils\n sudo yum-config-manager add-repo https://download.docker.com/linux/centos/docker-ce.repo\n sudo yum makecache fast\n sudo yum -y install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " More information is available at https://github.com/NVIDIA/nvidia-docker/blob/master/README.md. .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(." }, { "output": " If you do not run this command, you will have to remember to start the nvidia-docker service manually; otherwise the GPUs will not appear as available. .. code-block:: bash\n\n sudo systemctl enable nvidia-docker\n\n Alternatively, if you have installed Docker CE above you can install nvidia-docker with:\n\n .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/centos7/x86_64/nvidia-docker.repo | \\\n sudo tee /etc/yum.repos.d/nvidia-docker.repo\n sudo yum install nvidia-docker2\n\n4." }, { "output": " If the driver is not up and running, log on to http://www.nvidia.com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver. .. code-block:: bash\n\n nvidia-docker run rm nvidia/cuda nvidia-smi\n\n5." }, { "output": " Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\n8. Set up the data, log, and license directories on the host machine (within the new directory):\n\n .. code-block:: bash\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n9." }, { "output": " The data will be visible inside the Docker container. 10. Run ``docker images`` to find the image tag. 11. Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " For GPU users, as GPU needs ``pid=host`` for nvml, which makes tini not use pid=1, so it will show the warning message (still harmless). We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n12." }, { "output": " .. _install-on-rhel-cpus-only:\n\nInstall on RHEL with CPUs\n~\n\nThis section describes how to install and start the Driverless AI Docker image on RHEL. Note that this uses ``docker`` and not ``nvidia-docker``." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same. .. note::\n\tAs of this writing, Driverless AI has been tested on RHEL versions 7.4, 8.3, and 8.4." }, { "output": " Once you are logged in, perform the following steps. 1. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.com/engine/installation/linux/docker-ee/rhel/." }, { "output": " .. code-block:: bash\n\n sudo yum install -y yum-utils\n sudo yum-config-manager add-repo https://download.docker.com/linux/centos/docker-ce.repo\n sudo yum makecache fast\n sudo yum -y install docker-ce\n sudo systemctl start docker\n\n2." }, { "output": " 3. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI Docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " Copy data into the data directory on the host. The data will be visible inside the Docker container at //data. 7. Run ``docker images`` to find the image tag. 8. Start the Driverless AI Docker image." }, { "output": " Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. .. code-block:: bash\n :substitutions:\n\n $ docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n9." }, { "output": " HDFS Setup\n\n\nDriverless AI lets you explore HDFS data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with HDFS. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``hdfs_config_path`` (Required): The location the HDFS config folder path. This folder can contain multiple config files. - ``hdfs_auth_type`` (Required): Specifies the HDFS authentication." }, { "output": " - ``keytab``: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos keytab needs to be owned by the DAI user. - ``keytabimpersonation``: Login with impersonation using a keytab." }, { "output": " - ``key_tab_path``: The path of the principal key tab file. This is required when ``hdfs_auth_type='principal'``. - ``hdfs_app_principal_user``: The Kerberos application principal user. This is required when ``hdfs_auth_type='keytab'``." }, { "output": " Separate each argument with spaces. - ``-Djava.security.krb5.conf``\n - ``-Dsun.security.krb5.debug``\n - ``-Dlog4j.configuration``\n\n- ``hdfs_app_classpath``: The HDFS classpath. - ``hdfs_app_supported_schemes``: The list of DFS schemas that is used to check whether a valid input to the connector has been established." }, { "output": " Additional schemas can be supported by adding values that are not selected by default to the list. - ``hdfs://``\n - ``maprfs://``\n - ``swift://``\n\n- ``hdfs_max_files_listed``: Specifies the maximum number of files that are viewable in the connector UI." }, { "output": " To view more files, increase the default value. - ``hdfs_init_path``: Specifies the starting HDFS path displayed in the UI of the HDFS browser. - ``enabled_file_systems``: The file systems you want to enable." }, { "output": " Example 1: Enable HDFS with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the HDFS data connector and disables HDFS authentication. It does not pass any HDFS configuration file; however it configures Docker DNS by passing the name and IP of the HDFS name node." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs\" \\\n -e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth' \\\n -e DRIVERLESS_AI_PROCSY_PORT=8080 \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure HDFS options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed. - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the HDFS data connector and disables HDFS authentication in the config.toml file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " Note that the procsy port, which defaults to 12347, also has to be changed. ::\n\n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI." }, { "output": " If the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures. - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user; otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authentication and, hence, fail." }, { "output": " - Configures the environment variable ``DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER`` to reference a user for whom the keytab was created (usually in the form of user@realm). .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs\" \\\n -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab' \\\n -e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<>' \\\n -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<>' \\\n -e DRIVERLESS_AI_PROCSY_PORT=8080 \\ \n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed. - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n - ``hdfs_auth_type = \"keytab\"``\n - ``key_tab_path = \"/tmp/\"``\n - ``hdfs_app_principal_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n \n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n # HDFS connector\n # Auth type can be Principal/keytab/keytabPrincipal\n # Specify HDFS Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with HDFS with a principal user\n # keytab : Authenticate with a Key tab (recommended)\n # keytabimpersonation : Login with impersonation using a keytab\n hdfs_auth_type = \"keytab\"\n\n # Path of the principal key tab file\n key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n hdfs_app_principal_user = \"\"\n\n 3." }, { "output": " Example 3: Enable HDFS with Keytab-Based Impersonation\n\n\nNotes: \n\n- If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server. - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user." }, { "output": " .. tabs::\n .. group-tab:: Docker Image Installs\n\n The example:\n\n - Sets the authentication type to ``keytabimpersonation``. - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs\" \\\n -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation' \\\n -e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<>' \\\n -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<>' \\\n -e DRIVERLESS_AI_PROCSY_PORT=8080 \\ \n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example:\n\n - Sets the authentication type to ``keytabimpersonation``." }, { "output": " - Configures the ``hdfs_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). 1. Configure the Driverless AI config.toml file. Set the following configuration options." }, { "output": " - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n - ``hdfs_auth_type = \"keytabimpersonation\"``\n - ``key_tab_path = \"/tmp/\"``\n - ``hdfs_app_principal_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Sets the authentication type to ``keytabimpersonation``." }, { "output": " - Configures the ``hdfs_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). 1. Export the Driverless AI config.toml file or add it to ~/.bashrc." }, { "output": " Specify the following configuration options in the config.toml file. ::\n\n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n # HDFS connector\n # Auth type can be Principal/keytab/keytabPrincipal\n # Specify HDFS Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with HDFS with a principal user\n # keytab : Authenticate with a Key tab (recommended)\n # keytabimpersonation : Login with impersonation using a keytab\n hdfs_auth_type = \"keytabimpersonation\"\n\n # Path of the principal key tab file\n key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n hdfs_app_principal_user = \"\"\n\n 3." }, { "output": " Specifying a Hadoop Platform\n\n\nThe following example shows how to build an H2O-3 Hadoop image and run Driverless AI. This example uses CDH 6.0. Change the ``H2O_TARGET`` to specify a different platform." }, { "output": " Clone and then build H2O-3 for CDH 6.0. .. code-block:: bash\n\n git clone https://github.com/h2oai/h2o-3.git\n cd h2o-3\n ./gradlew clean build -x test\n export H2O_TARGET=cdh6.0\n export BUILD_HADOOP=true\n ./gradlew clean build -x test\n\n2." }, { "output": " .. code-block:: bash\n\n docker run -it rm \\\n -v `pwd`:`pwd` \\\n -w `pwd` \\\n entrypoint bash \\\n network=host \\\n -p 8020:8020 \\\n docker.h2o.ai/cdh-6-w-hive \\\n -c 'sudo -E startup.sh && \\\n source /envs/h2o_env_python3.8/bin/activate && \\\n hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars \"$(cat /opt/hive-jars/hive-libjars)\" -n 1 -mapperXmx 2g -baseport 54445 -notify h2o_one_node -ea -disown && \\\n export CLOUD_IP=localhost && \\\n export CLOUD_PORT=54445 && \\\n make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \\\n bash'\n\n3." }, { "output": " .. _running-docker-on-gce:\n\nInstall and Run in a Docker Container on Google Compute Engine\n\n\nThis section describes how to install and start Driverless AI from scratch using a Docker container in a Google Compute environment." }, { "output": " If you don't have an account, go to https://console.cloud.google.com/getting-started to create one. In addition, refer to Google's `Machine Types documentation `__ for information on Google Compute machine types." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same. Before You Begin\n\n\nIf you are trying GCP for the first time and have just created an account, check your Google Compute Engine (GCE) resource quota limits." }, { "output": " You can change these settings to match your quota limit, or you can request more resources from GCP. Refer to https://cloud.google.com/compute/quotas for more information, including information on how to check your quota and request additional quota." }, { "output": " In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/. 2. In the left navigation panel, select Compute Engine > VM Instances. .. image:: ../images/gce_newvm_instance.png\n :align: center\n :height: 390\n :width: 400\n\n3." }, { "output": " .. image:: ../images/gce_create_instance.png\n :align: center\n\n4. Specify the following at a minimum:\n\n - A unique name for this instance. - The desired `zone `__." }, { "output": " Refer to the following for information on how to add GPUs: https://cloud.google.com/compute/docs/gpus/. - A supported OS, for example Ubuntu 16.04. Be sure to also increase the disk size of the OS image to be 64 GB." }, { "output": " This creates the new VM instance. .. image:: ../images/gce_instance_settings.png\n :align: center\n :height: 446\n :width: 380\n\n5. Create a Firewall rule for Driverless AI. On the Google Cloud Platform left navigation panel, select VPC network > Firewall rules." }, { "output": " - Change the Targets dropdown to All instances in the network. - Specify the Source IP ranges to be ``0.0.0.0/0``. - Under Protocols and Ports, select Specified protocols and ports and enter the following: ``tcp:12345``." }, { "output": " .. image:: ../images/gce_create_firewall_rule.png\n :align: center\n :height: 452\n :width: 477\n\n6. On the VM Instances page, SSH to the new VM Instance by selecting Open in Browser Window from the SSH dropdown." }, { "output": " H2O provides a script for you to run in your VM instance. Open an editor in the VM instance (for example, vi). Copy one of the scripts below (depending on whether you are running GPUs or CPUs). Save the script as install.sh." }, { "output": " /etc/os-release;echo $ID$VERSION_ID)\n curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \\\n sudo tee /etc/apt/sources.list.d/nvidia-docker.list\n sudo apt-get update\n\n # Install nvidia-docker2 and reload the Docker daemon configuration\n sudo apt-get install -y nvidia-docker2\n\n .. code-block:: bash\n\n # SCRIPT FOR CPUs ONLY\n apt-get -y update \n apt-get -y no-install-recommends install \\\n curl \\\n apt-utils \\\n python-software-properties \\\n software-properties-common\n\n add-apt-repository -y \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - \n\n apt-get update \n apt-get install -y docker-ce\n\n\n8." }, { "output": " .. code-block:: bash\n\n chmod +x install.sh\n sudo ./install.sh\n\n9. In your user folder, create the following directories as your user. .. code-block:: bash\n\n mkdir ~/tmp\n mkdir ~/log\n mkdir ~/data\n mkdir ~/scripts\n mkdir ~/license\n mkdir ~/demo\n mkdir -p ~/jupyter/notebooks\n\n10." }, { "output": " .. code-block:: bash\n\n sudo usermod -aG docker \n\n\n11. Reboot the system to enable NVIDIA drivers. .. code-block:: bash\n\n sudo reboot\n\n12. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/." }, { "output": " Load the Driverless AI Docker image. The following example shows how to load Driverless AI. Replace VERSION with your image. .. code-block:: bash\n :substitutions:\n\n sudo docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n14." }, { "output": " Otherwise, you must enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " Note: Use ``docker version`` to check which version of Docker you are using. .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n16." }, { "output": " You can stop the instance using one of the following methods: \n\nStopping in the browser\n\n1. On the VM Instances page, click on the VM instance that you want to stop. 2. Click Stop at the top of the page." }, { "output": " Azure Blob Store Setup\n \n\nDriverless AI lets you explore Azure Blob Store data sources from within the Driverless AI application. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Supported Data Sources Using the Azure Blob Store Connector\n~\n\nThe following data sources can be used with the Azure Blob Store connector. - :ref:`Azure Blob Storage (general purpose v1)`\n- Blob Storage\n- :ref:`Azure Files (File Storage)`\n- :ref:`Azure Data Lake Storage Gen 2 (Storage V2)`\n\nThe following data sources can be used with the Azure Blob Store connector when also using the HDFS connector." }, { "output": " - ``azure_blob_account_name``: The Microsoft Azure Storage account name. This should be the dns prefix created when the account was created (for example, \"mystorage\"). - ``azure_blob_account_key``: Specify the account key that maps to your account name." }, { "output": " With this option, you can include an override for a host, port, and/or account name. For example, \n\n .. code:: bash\n\n azure_connection_string = \"DefaultEndpointsProtocol=http;AccountName=;AccountKey=;BlobEndpoint=http://:/;\"\n\n- ``azure_blob_init_path``: Specifies the starting Azure Blob store path displayed in the UI of the Azure Blob store browser." }, { "output": " This must be configured in order for data connectors to function properly. The following additional configuration attributes can be used for enabling an HDFS Connector to connect to Azure Data Lake Gen 1 (and optionally with Azure Data Lake Gen 2)." }, { "output": " This folder can contain multiple config files. - ``hdfs_app_classpath``: The HDFS classpath. - ``hdfs_app_supported_schemes``: Supported schemas list is used as an initial check to ensure valid input to connector." }, { "output": " This lets users reference data stored on your Azure storage account using the account name, for example: ``https://mystorage.blob.core.windows.net``. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,azrbs\" \\\n -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_NAME=\"mystorage\" \\\n -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_KEY=\"\" \\\n -p 12345:12345 \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure Azure Blob Store options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, azrbs\"``\n - ``azure_blob_account_name = \"mystorage\"``\n - ``azure_blob_account_key = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example shows how to enable the Azure Blob Store data connector in the config.toml file when starting Driverless AI in native installs." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, azrbs\"\n\n # Azure Blob Store Connector credentials\n azure_blob_account_name = \"mystorage\"\n azure_blob_account_key = \"\"\n\n 3." }, { "output": " .. _example2:\n\nExample 2: Mount Azure File Shares to the Local File System\n~\n\nSupported Data Sources Using the Local File System\n\n\n- Azure Files (File Storage) \n\nMounting Azure File Shares\n\n\nAzure file shares can be mounted into the Local File system of Driverless AI." }, { "output": " .. _example3:\n\nExample 3: Enable HDFS Connector to Connect to Azure Data Lake Gen 1\n~\n\nThis example enables the HDFS Connector to connect to Azure Data Lake Gen1. This lets users reference data stored on your Azure Data Lake using the adl uri, for example: ``adl://myadl.azuredatalakestore.net``." }, { "output": " Create an Azure AD web application for service-to-service authentication: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n\n 2." }, { "output": " Take note of the Hadoop Classpath and add the ``azure-datalake-store.jar`` file. This file can found on any Hadoop version in: ``$HADOOP_HOME/share/hadoop/tools/lib/*``. .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n\n 4." }, { "output": " Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['adl://']\"\n \n 5." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n\n 2. Add the information from your web application to the hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.adl.oauth2.access.token.provider.type\n ClientCredential\n \n \n fs.adl.oauth2.refresh.url\n Token endpoint created in step 1.\n \n \n fs.adl.oauth2.client.id\n Client ID created in step 1\n \n \n fs.adl.oauth2.credential\n Client Secret created in step 1\n \n \n fs.defaultFS\n ADL URIt\n \n \n\n 3." }, { "output": " This file can found on any hadoop version in: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n\n 4. Configure the Driverless AI config.toml file." }, { "output": " Save the changes when you are done, then stop/restart Driverless AI. .. _example4:\n\nExample 4: Enable HDFS Connector to Connect to Azure Data Lake Gen 2\n\n\nThis example enables the HDFS Connector to connect to Azure Data Lake Gen2." }, { "output": " .. tabs::\n .. group-tab:: Docker Image with the config.toml\n\n 1. Create an Azure Service Principal: https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal\n\n 2." }, { "output": " Add the information from your web application to the Hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.azure.account.auth.type\n OAuth\n \n \n fs.azure.account.oauth.provider.type\n org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\n \n \n fs.azure.account.oauth2.client.endpoint\n Token endpoint created in step 1.\n \n \n fs.azure.account.oauth2.client.id\n Client ID created in step 1\n \n \n fs.azure.account.oauth2.client.secret\n Client Secret created in step 1\n \n \n\n 4." }, { "output": " These files can found on any Hadoop version 3.2 or higher at: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n\n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n \n Note: ABFS is only supported for Hadoop version 3.2 or higher." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['abfs://']\"\n \n 6." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal\n\n 2. Grant permissions to the Service Principal created on step 1 to access blobs: https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad\n\n 3." }, { "output": " Take note of the Hadoop Classpath and add the required jar files. These files can found on any hadoop version 3.2 or higher at: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n \n Note: ABFS is only supported for hadoop version 3.2 or higher \n\n 5." }, { "output": " Set the following configuration options: \n\n .. code:: bash\n \n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['abfs://']\"\n \n 6." }, { "output": " Export MOJO artifact to Azure Blob Storage\n\n\nIn order to export the MOJO artifact to Azure Blob Storage, you must enable support for the shared access signatures (SAS) token. You can enable support for the SAS token by setting the following variables in the ``config.toml`` file:\n\n\n1." }, { "output": " ``artifacts_store=\"azure\"``\n3. ``artifacts_azure_sas_token=\"token\"``\n\nFor instructions on exporting artifacts, see :ref:`export_artifacts`. FAQ\n\n\nCan I connect to my storage account using Private Endpoints?" }, { "output": " Driverless AI can use private endpoints if Driverless AI is located in the allowed VNET. Does Driverless AI support secure transfer? Yes. The Azure Blob Store Connector make all connections over HTTPS." }, { "output": " .. _recipes-settings:\n\nRecipes Settings\n\n\n.. _included_transformers:\n\n``included_transformers``\n\n\n.. dropdown:: Include Specific Transformers\n\t:open:\n\n\tSelect the :ref:`transformer(s) ` that you want to use in the experiment." }, { "output": " Note: If you uncheck all transformers so that none is selected, Driverless AI will ignore this and will use the default list of transformers for that experiment. This list of transformers will vary for each experiment." }, { "output": " .. _included_models:\n\n``included_models``\n~\n\n.. dropdown:: Include Specific Models\n\t:open:\n\n\tSpecify the types of models that you want Driverless AI to build in the experiment. This list includes natively supported algorithms and models added with custom recipes." }, { "output": " Specifically:\n\n\t - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are ENABLED and the :ref:`sampling_method_for_imbalanced` is ENABLED (set to a value other than off), then Driverless AI will check your target imbalance fraction." }, { "output": " - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are DISABLED and the :ref:`sampling_method_for_imbalanced` option is ENABLED, then no special sampling technique will be performed. - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are ENABLED and the :ref:`sampling_method_for_imbalanced` is DISABLED, sampling will not be used, and these imbalanced models will be disabled." }, { "output": " .. _included_pretransformers:\n\n``included_pretransformers``\n\n\n.. dropdown:: Include Specific Preprocessing Transformers\n\t:open:\n\n\tSpecify which :ref:`transformers ` to use for preprocessing before other transformers are activated." }, { "output": " Notes:\n\n\t- Preprocessing transformers and all other layers of transformers are part of the Python and (if applicable) MOJO scoring packages. - Any :ref:`custom transformer recipe ` or native DAI transformer can be used as a preprocessing transformer." }, { "output": " Caveats:\n\t 1) one cannot currently do a time-series experiment on a time_column that hasn't yet been made (setup of experiment only knows about original data, not transformed). However, one can use a run-time data recipe to (e.g.)" }, { "output": " 2) in order to do a time series experiment with the GUI/client auto-selecting groups, periods, etc. the dataset\n\t must have time column and groups prepared ahead of experiment by user or via a one-time :ref:`data recipe `." }, { "output": " .. _num_pipeline_layers:\n\n``num_pipeline_layers``\n~\n\n.. dropdown:: Number of Pipeline Layers\n\t:open:\n\n\tSpecify the number of pipeline layers. This value defaults to 1. The equivalent config.toml parameter is ``num_pipeline_layers``." }, { "output": " .. _included_datas:\n\n``included_datas``\n\n\n.. dropdown:: Include Specific Data Recipes During Experiment\n\t:open:\n\n\tSpecify whether to include specific data recipes during the experiment. Avoids need for separate data preparation step, builds data preparation within experiment and within python scoring package." }, { "output": " The equivalent config.toml parameter is ``included_datas``. .. _included_individuals:\n\n``included_individuals``\n\n\n.. dropdown:: Include Specific Individuals\n\t:open:\n\n\tIn Driverless AI, every completed experiment automatically generates Python code for the experiment that corresponds to the individual(s) used to build the final model." }, { "output": " This feature gives you code-first access to a significant portion of DAI's internal transformer and model generation process. This expert setting lets you do one of the following:\n\n\t- Leave this field empty to have all individuals be freshly generated and treated by DAI's AutoML as a container of model and transformer choices." }, { "output": " If the number of included custom individuals is less than DAI needs, then the remaining individuals are freshly generated. The equivalent config.toml parameter is ``included_individuals``. For more information, see :ref:`individual_recipe`." }, { "output": " Select from the following:\n\n\t- Auto (Default): Use this option to sync the threshold scorer with the scorer used for the experiment. If this is not possible, F1 is used. - F05 More weight on precision, less weight on recall." }, { "output": " - F2: Less weight on precision, more weight on recall. - MCC: Use this option when all classes are equally important. ``prob_add_genes``\n\n\n.. dropdown:: Probability to Add Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to add genes or instances of transformers with specific attributes." }, { "output": " This value defaults to 0.5. ``prob_addbest_genes``\n\n\n.. dropdown:: Probability to Add Best Shared Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to add genes or instances of transformers with specific attributes that have shown to be beneficial to other individuals within the population." }, { "output": " ``prob_prune_genes``\n\n\n.. dropdown:: Probability to Prune Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to prune genes or instances of transformers with specific attributes. This value defaults to 0.5." }, { "output": " This value defaults to 0.25. ``prob_prune_by_features``\n\n\n.. dropdown:: Probability to Prune Weak Features\n\t:open:\n\n\tSpecify the unnormalized probability to prune features that have low variable importance instead of pruning entire instances of genes/transformers." }, { "output": " ``skip_transformer_failures``\n~\n\n.. dropdown:: Whether to Skip Failures of Transformers\n\t:open:\n\n\tSpecify whether to avoid failed transformers. This is enabled by default. ``skip_model_failures``\n~\n\n.. dropdown:: Whether to Skip Failures of Models\n\t:open:\n\n\tSpecify whether to avoid failed models." }, { "output": " This is enabled by default. ``detailed_skip_failure_messages_level``\n\n\n.. dropdown:: Level to Log for Skipped Failures\n\t:open:\n\n\tSpecify one of the following levels for the verbosity of log failure messages for skipped transformers or models:\n\n\t- 0 = Log simple message\n\t- 1 = Log code line plus message (Default)\n\t- 2 = Log detailed stack traces\n\n``notify_failures``\n~\n\n.. dropdown:: Whether to Notify About Failures of Transformers or Models or Other Recipe Failures\n\t:open:\n\n\tSpecify whether to display notifications in the GUI about recipe failures." }, { "output": " The equivalent config.toml parameter is ``notify_failures``. ``acceptance_test_timeout``\n~\n\n.. dropdown:: Timeout in Minutes for Testing Acceptance of Each Recipe\n\t:open:\n\n\tSpecify the number of minutes to wait until a recipe's acceptance testing is aborted." }, { "output": " .. _install-gcp-offering:\n\nInstall the Google Cloud Platform Offering\n\n\nThis section describes how to install and start Driverless AI in a Google Compute environment using the GCP Marketplace. This assumes that you already have a Google Cloud Platform account." }, { "output": " Before You Begin\n\n\nIf you are trying GCP for the first time and have just created an account, check your Google Compute Engine (GCE) resource quota limits. By default, GCP allocates a maximum of 8 CPUs and no GPUs." }, { "output": " You can change these settings to match your quota limit, or you can request more resources from GCP. Refer to https://cloud.google.com/compute/quotas for more information, including information on how to check your quota and request additional quota." }, { "output": " In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/. 2. In the left navigation panel, select Marketplace. .. image:: ../images/google_cloud_launcher.png\n :align: center\n :height: 266\n :width: 355\n\n3." }, { "output": " The following page will display. .. image:: ../images/google_driverlessai_offering.png\n :align: center\n\n4. Click Launch on Compute Engine. (If necessary, refer to `Google Compute Instance Types `__ for information about machine and GPU types.)" }, { "output": " (This defaults to 32 CPUs and 120 GB RAM.) - Specify a GPU type. (This defaults to a p100 GPU.) - Optionally change the number of GPUs. (Default is 2.) - Specify the boot disk type and size. - Optionally change the network name and subnetwork names." }, { "output": " - Click Deploy when you are done. Driverless AI will begin deploying. Note that this can take several minutes. .. image:: ../images/google_deploy_compute_engine.png\n :align: center\n\n5. A summary page displays when the compute engine is successfully deployed." }, { "output": " Click on the Instance link to retrieve the external IP address for starting Driverless AI. .. image:: ../images/google_deploy_summary.png\n :align: center\n\n6. In your browser, go to https://[External_IP]:12345 to start Driverless AI." }, { "output": " Agree to the Terms and Conditions. 8. Log in to Driverless AI using your user name and password. 9. Optionally enable GCS and Big Query access. a. In order to enable GCS and Google BigQuery access, you must pass the running instance a service account json file configured with GCS and GBQ access." }, { "output": " Obtain a functioning service account json file from `GCP `__, rename it to \"service_account.json\", and copy it to the Ubuntu user on the running instance." }, { "output": " c. Restart the machine for the changes to take effect. .. code-block:: bash\n\n sudo systemctl stop dai\n\n # Wait for the system to stop\n\n # Verify that the system is no longer running\n sudo systemctl status dai\n\n # Restart the system\n sudo systemctl start dai\n\nUpgrading the Google Cloud Platform Offering\n\n\nPerform the following steps to upgrade the Driverless AI Google Platform offering." }, { "output": " Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade." }, { "output": " .. _time-series-settings:\n\nTime Series Settings\n\n\n.. _time-series-lag-based-recipe:\n\n``time_series_recipe``\n\n.. dropdown:: Time-Series Lag-Based Recipe\n\t:open:\n\n\tThis recipe specifies whether to include Time Series lag features when training a model with a provided (or autodetected) time column." }, { "output": " Lag features are the primary automatically generated time series features and represent a variable's past values. At a given sample with time stamp :math:`t`, features at some time difference :math:`T` (lag) in the past are considered." }, { "output": " Lags can be created on any feature as well as on the target. Lagging variables are important in time series because knowing what happened in different time periods in the past can greatly facilitate predictions for the future." }, { "output": " Ensembling is also disabled if a time column is selected or if time column is set to [Auto] on the experiment setup screen. More information about time series lag is available in the :ref:`time-series-use-case` section." }, { "output": " Note that it's possible to rerun another such diverse leaderboard on top of the best-performing model(s), which will effectively help you compose these expert settings. - 'sliding_window': If the forecast horizon is N periods, create a separate model for \"each of the (gap, horizon) pairs of (0,n), (n,n), (2*n,n), ..., (2*N-1, n) in units of time periods." }, { "output": " This can help to improve short-term forecasting quality. ``time_series_leaderboard_periods_per_model``\n~\n.. dropdown:: Number of periods per model if time_series_leaderboard_mode is 'sliding_window'\n\t:open:\n\n\tSpecify the number of periods per model if ``time_series_leaderboard_mode`` is set to ``sliding_window``." }, { "output": " .. _time_series_merge_splits:\n\n``time_series_merge_splits``\n\n.. dropdown:: Larger Validation Splits for Lag-Based Recipe\n\t:open:\n\n\tSpecify whether to create larger validation splits that are not bound to the length of the forecast horizon." }, { "output": " This is enabled by default. ``merge_splits_max_valid_ratio``\n\n.. dropdown:: Maximum Ratio of Training Data Samples Used for Validation\n\t:open:\n\n\tSpecify the maximum ratio of training data samples used for validation across splits when larger validation splits are created (see :ref:`time_series_merge_splits` setting)." }, { "output": " .. _fixed_size_splits:\n\n``fixed_size_splits``\n~\n.. dropdown:: Fixed-Size Train Timespan Across Splits\n\t:open:\n\n\tSpecify whether to keep a fixed-size train timespan across time-based splits during internal validation." }, { "output": " This is disabled by default. ``time_series_validation_fold_split_datetime_boundaries``\n~\n.. dropdown:: Custom Validation Splits for Time-Series Experiments\n\t:open:\n\n\tSpecify date or datetime timestamps (in the same format as the time column) to use for custom training and validation splits." }, { "output": " This value defaults to 30. .. _holiday-calendar:\n\n``holiday_features``\n\n.. dropdown:: Generate Holiday Features\n\t:open:\n\n\tFor time-series experiments, specify whether to generate holiday features for the experiment." }, { "output": " ``holiday_countries``\n~\n.. dropdown:: Country code(s) for holiday features\n\t:open:\n\n\tSpecify country codes in the form of a list that is used to look up holidays. Note: This setting is for migration purposes only." }, { "output": " The lag values provided here are the only set of lags to be explored in the experiment. The following examples show the variety of different methods that can be used to specify override lags:\n\n\t- \"[0]\" disable lags\n\t- \"[7, 14, 21]\" specifies this exact list\n\t- \"21\" specifies every value from 1 to 21\n\t- \"21:3\" specifies every value from 1 to 21 in steps of 3\n\t- \"5-21\" specifies every value from 5 to 21\n\t- \"5-21:3\" specifies every value from 5 to 21 in steps of 3\n\n``override_ufapt_lag_sizes``\n\n.. dropdown:: Lags Override for Features That are not Known Ahead of Time\n\t:open:\n\n\tSpecify lags override for non-target features that are not known ahead of time." }, { "output": " - \"[0]\" disable lags\n\t- \"[7, 14, 21]\" specifies this exact list\n\t- \"21\" specifies every value from 1 to 21\n\t- \"21:3\" specifies every value from 1 to 21 in steps of 3\n\t- \"5-21\" specifies every value from 5 to 21\n\t- \"5-21:3\" specifies every value from 5 to 21 in steps of 3\n\n``min_lag_size``\n\n.. dropdown:: Smallest Considered Lag Size\n\t:open:\n\n\tSpecify a minimum considered lag size." }, { "output": " ``allow_time_column_as_feature``\n\n.. dropdown:: Enable Feature Engineering from Time Column\n\t:open:\n\n\tSpecify whether to enable feature engineering based on the selected time column, e.g. Date~weekday." }, { "output": " ``allow_time_column_as_numeric_feature``\n\n.. dropdown:: Allow Integer Time Column as Numeric Feature\n\t:open:\n\n\tSpecify whether to enable feature engineering from an integer time column. Note that if you are using a time series recipe, using a time column (numeric time stamps) as an input feature can lead to a model that memorizes the actual timestamps instead of features that generalize to the future." }, { "output": " ``datetime_funcs``\n\n.. dropdown:: Allowed Date and Date-Time Transformations\n\t:open:\n\n\tSpecify the date or date-time transformations to allow Driverless AI to use. Choose from the following transformers:\n\n\t- year\n\t- quarter\n\t- month\n\t- week\n\t- weekday\n\t- day\n\t- dayofyear\n\t- num (direct numeric value representing the floating point value of time, disabled by default)\n\t- hour\n\t- minute\n\t- second\n\n\tFeatures in Driverless AI will appear as ``get_`` followed by the name of the transformation." }, { "output": " .. _filter_datetime_funcs:\n\n``filter_datetime_funcs``\n~\n.. dropdown:: Auto Filtering of Date and Date-Time Transformations\n\t:open:\n\n\tWhether to automatically filter out date and date-time transformations that would lead to unseen values in the future." }, { "output": " ``allow_tgc_as_features``\n~\n.. dropdown:: Consider Time Groups Columns as Standalone Features\n\t:open:\n\n\tSpecify whether to consider time groups columns as standalone features. This is disabled by default." }, { "output": " If \"Consider time groups columns as standalone features\" is enabled, then specify which TGC feature types to consider as standalone features. Available types are numeric, categorical, ohe_categorical, datetime, date, and text." }, { "output": " Note that \"time_column\" is treated separately via the \"Enable Feature Engineering from Time Column\" option. Also note that if \"Time Series Lag-Based Recipe\" is disabled, then all time group columns are allowed features." }, { "output": " This is set to Auto by default. ``tgc_only_use_all_groups``\n~\n.. dropdown:: Always Group by All Time Groups Columns for Creating Lag Features\n\t:open:\n\n\tSpecify whether to group by all time groups columns for creating lag features, instead of sampling from them." }, { "output": " ``tgc_allow_target_encoding``\n~\n.. dropdown:: Allow Target Encoding of Time Groups Columns\n\t:open:\n\n\tSpecify whether it is allowed to target encode the time groups columns. This is disabled by default." }, { "output": " - Subgroups can be encoded by disabling ``tgc_only_use_all_groups``. ``time_series_holdout_preds``\n~\n.. dropdown:: Generate Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to create diagnostic holdout predictions on training data using moving windows." }, { "output": " This can be useful for MLI, but it will slow down the experiment considerably when enabled. Note that the model itself remains unchanged when this setting is enabled. ``time_series_validation_splits``\n~\n.. dropdown:: Number of Time-Based Splits for Internal Model Validation\n\t:open:\n\n\tSpecify a fixed number of time-based splits for internal model validation." }, { "output": " This value defaults to -1 (auto). ``time_series_splits_max_overlap``\n\n.. dropdown:: Maximum Overlap Between Two Time-Based Splits\n\t:open:\n\n\tSpecify the maximum overlap between two time-based splits. The amount of possible splits increases with higher values." }, { "output": " ``time_series_max_holdout_splits``\n\n.. dropdown:: Maximum Number of Splits Used for Creating Final Time-Series Model's Holdout Predictions\n\t:open:\n\n\tSpecify the maximum number of splits used for creating the final time-series Model's holdout predictions." }, { "output": " Use \t``time_series_validation_splits`` to control amount of time-based splits used for model validation. ``mli_ts_fast_approx``\n\n.. dropdown:: Whether to Speed up Calculation of Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to speed up time-series holdout predictions for back-testing on training data." }, { "output": " Note that predictions can be slightly less accurate when this setting is enabled. This is disabled by default. ``mli_ts_fast_approx_contribs``\n~\n.. dropdown:: Whether to Speed up Calculation of Shapley Values for Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to speed up Shapley values for time-series holdout predictions for back-testing on training data." }, { "output": " Note that predictions can be slightly less accurate when this setting is enabled. This is enabled by default. ``mli_ts_holdout_contribs``\n~\n.. dropdown:: Generate Shapley Values for Time-Series Holdout Predictions at the Time of Experiment\n\t:open:\n\n\tSpecify whether to enable the creation of Shapley values for holdout predictions on training data using moving windows at the time of the experiment." }, { "output": " If this setting is disabled, MLI will generate Shapley values on demand. This is enabled by default. ``time_series_min_interpretability``\n\n.. dropdown:: Lower Limit on Interpretability Setting for Time-Series Experiments (Implicitly Enforced)\n\t:open:\n\n\tSpecify the lower limit on interpretability setting for time-series experiments." }, { "output": " To disable this setting, set this value to 1. ``lags_dropout``\n\n.. dropdown:: Dropout Mode for Lag Features\n\t:open:\n\n\tSpecify the dropout mode for lag features in order to achieve an equal n.a. ratio between train and validation/tests." }, { "output": " Dependent mode takes the lag-size dependencies per sample/row into account. Dependent is enabled by default. ``prob_lag_non_targets``\n\n.. dropdown:: Probability to Create Non-Target Lag Features\n\t:open:\n\n\tLags can be created on any feature as well as on the target." }, { "output": " This value defaults to 0.1. .. _rolling-test-set-method:\n\n``rolling_test_method``\n~\n.. dropdown:: Method to Create Rolling Test Set Predictions\n\t:open:\n\n\tSpecify the method used to create rolling test set predictions." }, { "output": " TTA is enabled by default. Notes: \n\t\n\t- This setting only applies to the test set that is provided by the user during an experiment. - This setting only has an effect if the provided test set spans more periods than the forecast horizon and if the target values of the test set are known." }, { "output": " This is enabled by default. ``prob_default_lags``\n~\n.. dropdown:: Probability for New Time-Series Transformers to Use Default Lags\n\t:open:\n\n\tSpecify the probability for new lags or the EWMA gene to use default lags." }, { "output": " This value defaults to 0.2. ``prob_lagsinteraction``\n\n.. dropdown:: Probability of Exploring Interaction-Based Lag Transformers\n\t:open:\n\n\tSpecify the unnormalized probability of choosing other lag time-series transformers based on interactions." }, { "output": " ``prob_lagsaggregates``\n~\n.. dropdown:: Probability of Exploring Aggregation-Based Lag Transformers\n\t:open:\n\n\tSpecify the unnormalized probability of choosing other lag time-series transformers based on aggregations." }, { "output": " .. _centering-detrending:\n\n``ts_target_trafo``\n~\n.. dropdown:: Time Series Centering or Detrending Transformation\n\t:open:\n\n\tSpecify whether to use centering or detrending transformation for time series experiments." }, { "output": " Linear or Logistic will remove the fitted linear or logistic trend, Centering will only remove the mean of the target signal and Epidemic will remove the signal specified by a `Susceptible-Infected-Exposed-Recovered-Dead `_ (SEIRD) epidemic model." }, { "output": " Notes:\n\n\t- MOJO support is currently disabled when this setting is enabled. - The Fast centering and linear detrending options use least squares fitting. - The Robust centering and linear detrending options use `random sample consensus `_ (RANSAC) to achieve higher tolerance w.r.t." }, { "output": " - Please see (:ref:`Custom Bounds for SEIRD Epidemic Model Parameters `) for further details on how to customize the bounds of the free SEIRD parameters. .. _seird_parameters:\n\n``ts_target_trafo_epidemic_params_dict``\n\n.. dropdown:: Custom Bounds for SEIRD Epidemic Model Parameters\n\t:open:\n\n\tSpecify the custom bounds for controlling `Susceptible-Infected-Exposed-Recovered-Dead `_ (SEIRD) epidemic model parameters for detrending of the target for each time series group." }, { "output": " For each training split and time series group, the SEIRD model is fit to the target signal by optimizing a set of free parameters for each time series group. The model's value is then subtracted from the training response, and the residuals are passed to the feature engineering and modeling pipeline." }, { "output": " The following is a list of free parameters:\n\n\t- N: Total population, *N = S+E+I+R+D*\n\t- beta: Rate of exposure (*S* -> *E*)\n\t- gamma: Rate of recovering (*I* -> *R*)\n\t- delta: Incubation period\n\t- alpha: Fatality rate\n\t- rho: Rate at which individuals expire\n\t- lockdown: Day of lockdown (-1 => no lockdown)\n\t- beta_decay: Beta decay due to lockdown\n\t- beta_decay_rate: Speed of beta decay\n\n\tProvide upper or lower bounds for each parameter you want to control." }, { "output": " For example:\n\n\t::\n\n\t ts_target_trafo_epidemic_params_dict=\"{'N_min': 1000, 'beta_max': 0.2}\"\n\n\tRefer to https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology and https://arxiv.org/abs/1411.3435 for more information on the SEIRD model." }, { "output": " To get the SEIR model, set ``alpha_min=alpha_max=rho_min=rho_max=beta_decay_rate_min=beta_decay_rate_max=0`` and ``lockdown_min=lockdown_max=-1``. ``ts_target_trafo_epidemic_target``\n~\n.. dropdown:: Which SEIRD Model Component the Target Column Corresponds To\n\t:open:\n\n\tSpecify a SEIRD model component for the target column to correspond to." }, { "output": " Select from None (default), Difference, and Ratio. Notes:\n\n\t- MOJO support is currently disabled when this setting is enabled. - The corresponding lag size is specified with the ``ts_target_trafo_lag_size`` expert setting." }, { "output": " .. _install-on-aws:\n\nInstall on AWS\n\n\nDriverless AI can be installed on Amazon AWS using the AWS Marketplace AMI or the AWS Community AMI. .. toctree::\n :maxdepth: 1\n \n choose-AWS\n aws-marketplace-ami\n aws-community-ami\n\nWhen installing via AWS, you can also enable role-based authentication." }, { "output": " Google Cloud Storage Setup\n\n\nDriverless AI lets you explore Google Cloud Storage data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google Cloud Storage." }, { "output": " If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication. In order to enable the GCS data connector with authentication, you must:\n\n1." }, { "output": " 2. Mount the JSON file to the Docker instance. 3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json config option. Notes:\n\n- The account JSON includes authentications as provided by the system administrator." }, { "output": " - Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using." }, { "output": " - ``gcs_init_path``: Specifies the starting GCS path displayed in the UI of the GCS browser. Start GCS with Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the GCS data connector with authentication by passing the JSON authentication file." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,gcs\" \\\n -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON=\"/service_account_json.json\" \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v `pwd`/service_account_json.json:/service_account_json.json \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure the GCS data connector options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, gcs\"``\n - ``gcs_path_to_service_account_json = \"/service_account_json.json\"`` \n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the GCS data connector with authentication by passing the JSON authentication file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, gcs\"\n\n # GCS Connector credentials\n # example (suggested) \"/licenses/my_service_account_json.json\"\n gcs_path_to_service_account_json = \"/service_account_json.json\"\n\n 3." }, { "output": " .. _model-settings:\n\nModel Settings\n\n\n``enable_constant_model``\n~\n.. dropdown:: Constant Models\n\t:open:\n\n\tSpecify whether to enable :ref:`constant models `. This is set to Auto (enabled) by default." }, { "output": " This is set to Auto by default. In this case, Driverless AI will build Decision Tree models if interpretability is greater than or equal to the value of ``decision_tree_interpretability_switch`` (which defaults to 7) and accuracy is less than or equal to ``decision_tree_accuracy_switch`` (which defaults to 7)." }, { "output": " GLMs are very interpretable models with one coefficient per feature, an intercept term and a link function. This is set to Auto by default (enabled if accuracy <= 5 and interpretability >= 6). ``enable_xgboost_gbm``\n\n.. dropdown:: XGBoost GBM Models\n\t:open:\n\n\tSpecify whether to build XGBoost models as part of the experiment (for both the feature engineering part and the final model)." }, { "output": " This is set to Auto by default. In this case, Driverless AI will use XGBoost unless the number of rows * columns is greater than a threshold. This threshold is a config setting that is 100M by default for CPU and 30M by default for GPU." }, { "output": " LightGBM Models are the default models. This is set to Auto (enabled) by default. ``enable_xgboost_dart``\n~\n.. dropdown:: XGBoost Dart Models\n\t:open:\n\n\tSpecify whether to use XGBoost's Dart method when building models for experiment (for both the feature engineering part and the final model)." }, { "output": " .. _enable_xgboost_rapids:\n\n``enable_xgboost_rapids``\n~\n.. dropdown:: Enable RAPIDS-cuDF extensions to XGBoost GBM/Dart\n\t:open:\n\n\tSpecify whether to enable RAPIDS extensions to XGBoost GBM/Dart. If selected, python scoring package can only be used on GPU system." }, { "output": " Disabled for dask multinode models due to bug in dask_cudf and xgboost. .. _enable_xgboost_rf:\n\n``enable_xgboost_rf``\n~\n\n.. dropdown:: Enable XGBoost RF model\n\t:open:\n\n\tSpecify whether to enable XGBoost RF mode without early stopping." }, { "output": " .. _enable_xgboost_gbm_dask:\n\n``enable_xgboost_gbm_dask``\n~\n.. dropdown:: Enable Dask_cuDF (multi-GPU) XGBoost GBM\n\t:open:\n\n\tSpecify whether to enable Dask_cudf (multi-GPU) version of XGBoost GBM. Disabled unless switched on." }, { "output": " No Shapley possible. The equivalent config.toml parameter is ``enable_xgboost_gbm_dask`` and the default value is \"auto\". .. _enable_xgboost_dart_dask:\n\n``enable_xgboost_dart_dask``\n\n.. dropdown:: Enable Dask_cuDF (multi-GPU) XGBoost Dart\n\t:open:\n\n\tSpecify whether to enable Dask_cudf (multi-GPU) version of XGBoost GBM/Dart." }, { "output": " Only applicable for single final model without early stopping. No Shapley is possible. The equivalent config.toml parameter is ``enable_xgboost_dart_dask`` and the default value is \"auto\". It is recommended to run Dask_cudf on multi gpus; if for say debugging purposes, user would like to enable them on 1 GPU, then set ``use_dask_for_1_gpu`` to True via config.toml setting." }, { "output": " It is disabled by default unless switched on. The equivalent config.toml parameter is ``enable_lightgbm_dask`` and default value is \"auto\". To enable multinode Dask see :ref:`Dask Multinode Training `." }, { "output": " \"auto\" and \"on\" are same currently. Dask mode for hyperparameter search is enabled if:\n\n\t\t1) Have a :ref:`Dask multinode cluster ` or multi-GPU node and model uses 1 GPU for each model( see :ref:`num-gpus-per-model`)." }, { "output": " The equivalent config.toml parameter is ``enable_hyperopt_dask`` and the default value is \"auto\". .. _num_inner_hyperopt_trials_prefinal:\n\n``num_inner_hyperopt_trials_prefinal``\n\n.. dropdown:: Number of trials for hyperparameter optimization during model tuning only\n\t:open:\n\n\tSpecify the number of trials for Optuna hyperparameter optimization for tuning and evolution of models." }, { "output": " 0 means no trials. For small data, 100 is fine, while for larger data smaller values are reasonable if need results quickly. If using RAPIDS or DASK, hyperparameter optimization stays on GPU the entire time." }, { "output": " Note that, this is useful when there is high overhead of DAI outside inner model fit/predict (i.e the various file, process, and other DAI management processes), so this tunes without that overhead. However, this can overfit on a single fold when doing tuning or evolution, and if using Cross Validation then, averaging the fold hyperparameters can lead to unexpected results." }, { "output": " If using RAPIDS or DASK, this is number of trials for rapids-cudf hyperparameter optimization within XGBoost GBM/Dart and LightGBM, and hyperparameter optimization keeps data on GPU entire time. 0 means no trials.For small data, 100 is ok choice, while for larger data smaller values are reasonable if need results quickly." }, { "output": " The equivalent config.toml parameter is ``num_inner_hyperopt_trials_final`` and the default value is 0. ``num_hyperopt_individuals_final``\n\n.. dropdown:: Number of individuals in final ensemble to use Optuna on\n\t:open:\n\n\tNumber of individuals in final model (all folds/repeats for given base model) to optimize with Optuna hyperparameter tuning." }, { "output": " 0 is same as choosing no Optuna trials. Might be only beneficial to optimize hyperparameters of best individual (i.e. value of 1) in ensemble. The default value is -1, means all. The equivalent config.toml parameter is ``num_hyperopt_individuals_final``\n\n``optuna_pruner``\n~\n.. dropdown:: Optuna Pruners\n\t:open:\n\n\t`Optuna Pruner `__ algorithm to use for early stopping of unpromising trials (applicable to XGBoost and LightGBM that support Optuna callbacks)." }, { "output": " To disable choose None. The equivalent config.toml parameter is ``optuna_pruner``\n\n``optuna_sampler``\n\n.. dropdown:: Optuna Samplers\n\t:open:\n\n\t`Optuna Sampler `__ algorithm to use for narrowing down and optimizing the search space (applicable to XGBoost and LightGBM that support Optuna callbacks)." }, { "output": " To disable choose None. The equivalent config.toml parameter is ``optuna_sampler``\n\n``enable_xgboost_hyperopt_callback``\n\n\n.. dropdown:: Enable Optuna XGBoost Pruning callback\n\t:open:\n\n\tSpecify whether to enable Optuna's XGBoost Pruning callback to abort unpromising runs." }, { "output": " This not is enabled when tuning learning rate. The equivalent config.toml parameter is ``enable_xgboost_hyperopt_callback``\n\n``enable_lightgbm_hyperopt_callback``\n~\n.. dropdown:: Enable Optuna LightGBM Pruning callback\n\t:open:\n\n\tSpecify whether to enable Optuna's LightGBM Pruning callback to abort unpromising runs." }, { "output": " This not is enabled when tuning learning rate. The equivalent config.toml parameter is ``enable_lightgbm_hyperopt_callback``\n\n``enable_tensorflow``\n~\n.. dropdown:: TensorFlow Models\n\t:open:\n\n\tSpecify whether to build `TensorFlow `__ models as part of the experiment (usually only for text features engineering and for the final model unless it's used exclusively)." }, { "output": " This is set to Auto by default (not used unless the number of classes is greater than 10). TensorFlow models are not yet supported by Java MOJOs (only Python scoring pipelines and C++ MOJOs are supported)." }, { "output": " By default, this parameter is set to auto i.e Driverless decides internally whether to use the algorithm for the experiment. Set it to *on* to force the experiment to build a GrowNet model. ``enable_ftrl``\n~\n.. dropdown:: FTRL Models\n\t:open:\n\n\tSpecify whether to build Follow the Regularized Leader (FTRL) models as part of the experiment." }, { "output": " FTRL supports binomial and multinomial classification for categorical targets, as well as regression for continuous targets. This is set to Auto (disabled) by default. ``enable_rulefit``\n\n.. dropdown:: RuleFit Models\n\t:open:\n\n\tSpecify whether to build `RuleFit `__ models as part of the experiment." }, { "output": " Note that multiclass classification is not yet supported for RuleFit models. Rules are stored to text files in the experiment directory for now. This is set to Auto (disabled) by default. .. _zero-inflated:\n\n``enable_zero_inflated_models``\n~\n.. dropdown:: Zero-Inflated Models\n\t:open:\n\n\tSpecify whether to enable the automatic addition of :ref:`zero-inflated models ` for regression problems with zero-inflated target values that meet certain conditions:\n\n\t::\n\n\t y >= 0, y.std() > y.mean()\")\n\n\tThis is set to Auto by default." }, { "output": " Select one or more of the following:\n\n\t- gbdt: Boosted trees\n\t- rf_early_stopping: Random Forest with early stopping\n\t- rf: Random Forest\n\t- dart: Dropout boosted trees with no early stopping\n\n\tgbdt and rf are both enabled by default." }, { "output": " This is disabled by default. Notes:\n\n\t- Only supported for CPU. - A MOJO is not built when this is enabled. .. _lightgbm_cuda:\n\n``enable_lightgbm_cuda_support``\n\n.. dropdown:: LightGBM CUDA Support\n\t:open:\n\n\tSpecify whether to enable LightGBM CUDA implementation instead of OpenCL." }, { "output": " ``show_constant_model``\n~\n.. dropdown:: Whether to Show Constant Models in Iteration Panel\n\t:open:\n\n\tSpecify whether to show constant models in the iteration panel. This is disabled by default. ``params_tensorflow``\n~\n.. dropdown:: Parameters for TensorFlow\n\t:open:\n\n\tSpecify specific parameters for TensorFlow to override Driverless AI parameters." }, { "output": " Different strategies for using TensorFlow parameters can be viewed `here `__. .. _max-trees-iterations:\n\n``max_nestimators``\n~\n.. dropdown:: Max Number of Trees/Iterations\n\t:open:\n\n\tSpecify the upper limit on the number of trees (GBM) or iterations (GLM)." }, { "output": " Depending on accuracy settings, a fraction of this limit will be used. ``n_estimators_list_no_early_stopping``\n~\n.. dropdown:: n_estimators List to Sample From for Model Mutations for Models That Do Not Use Early Stopping\n\t:open:\n\n\tFor LightGBM, the dart and normal random forest modes do not use early stopping." }, { "output": " ``min_learning_rate_final``\n~\n.. dropdown:: Minimum Learning Rate for Final Ensemble GBM Models\n\t:open:\n\n\tThis value defaults to 0.01. This is the lower limit on learning rate for final ensemble GBM models.In some cases, the maximum number of trees/iterations is insufficient for the final learning rate, which can lead to no early stopping getting triggered and poor final model performance." }, { "output": " ``max_learning_rate_final``\n~\n.. dropdown:: Maximum Learning Rate for Final Ensemble GBM Models\n\t:open:\n\n\tSpecify the maximum (upper limit) learning rate for final ensemble GBM models. This value defaults to 0.05." }, { "output": " This option defaults to 0.2. So by default, Driverless AI will produce no more than 0.2 * 3000 trees/iterations during feature evolution. .. _max_abs_score_delta_train_valid:\n\n``max_abs_score_delta_train_valid``\n~\n.. dropdown:: Max." }, { "output": " Keep in mind that the meaning of this value depends on the chosen scorer and the dataset (i.e., 0.01 for LogLoss is different than 0.01 for MSE). This option is Experimental, and only for expert use to keep model complexity low." }, { "output": " By default this option is disabled. .. _max_rel_score_delta_train_valid:\n\n``max_rel_score_delta_train_valid``\n~\n.. dropdown:: Max. relative delta between training and validation scores for tree models\n\t:open:\n\n\tModify early stopping behavior for tree-based models (LightGBM, XGBoostGBM, CatBoost) such that training score (on training data, not holdout) and validation score differ no more than this relative value (i.e., stop adding trees once abs(train_score - valid_score) > max_rel_score_delta_train_valid * abs(train_score))." }, { "output": " This option is Experimental, and only for expert use to keep model complexity low. To disable, set to 0.0. By default this option is disabled. ``min_learning_rate``\n~\n.. dropdown:: Minimum Learning Rate for Feature Engineering GBM Models\n\t:open:\n\n\tSpecify the minimum learning rate for feature engineering GBM models." }, { "output": " ``max_learning_rate``\n~\n.. dropdown:: Max Learning Rate for Tree Models\n\t:open:\n\n\tSpecify the maximum learning rate for tree models during feature engineering. Higher values can speed up feature engineering but can hurt accuracy." }, { "output": " ``max_epochs``\n\n.. dropdown:: Max Number of Epochs for TensorFlow/FTRL\n\t:open:\n\n\tWhen building TensorFlow or FTRL models, specify the maximum number of epochs to train models with (it might stop earlier)." }, { "output": " This option is ignored if TensorFlow models and/or FTRL models is disabled. ``max_max_depth``\n~\n.. dropdown:: Max Tree Depth\n\t:open:\n\n\tSpecify the maximum tree depth. The corresponding maximum value for ``max_leaves`` is double the specified value." }, { "output": " ``max_max_bin``\n~\n.. dropdown:: Max max_bin for Tree Features\n\t:open:\n\n\tSpecify the maximum ``max_bin`` for tree features. This value defaults to 256. ``rulefit_max_num_rules``\n~\n.. dropdown:: Max Number of Rules for RuleFit\n\t:open:\n\n\tSpecify the maximum number of rules to be used for RuleFit models." }, { "output": " .. _ensemble_meta_learner:\n\n``ensemble_meta_learner``\n~\n.. dropdown:: Ensemble Level for Final Modeling Pipeline\n\t:open:\n\n\tModel to combine base model predictions, for experiments that create a final pipeline\n\tconsisting of multiple base models:\n\n\t- blender: Creates a linear blend with non-negative weights that add to 1 (blending) - recommended\n\t- extra_trees: Creates a tree model to non-linearly combine the base models (stacking) - experimental, and recommended to also set enable :ref:`cross_validate_meta_learner`." }, { "output": " (Default)\n\t- 0 = No ensemble, only final single model on validated iteration/tree count. Note that holdout predicted probabilities will not be available. (For more information, refer to this :ref:`FAQ `.)" }, { "output": " .. _cross_validate_meta_learner:\n\n``cross_validate_meta_learner``\n~\n.. dropdown:: Ensemble Level for Final Modeling Pipeline\n\t:open:\n\n\tIf enabled, use cross-validation to create an ensemble for the meta learner itself." }, { "output": " No MOJO will be created if this setting is enabled. Not needed for ensemble_meta_learner='blender'. ``cross_validate_single_final_model``\n~\n.. dropdown:: Cross-Validate Single Final Model\n\t:open:\n\n\tDriverless AI normally produces a single final model for low accuracy settings (typically, less than 5)." }, { "output": " The final pipeline will build :math:`N+1` models, with N-fold cross validation for the single final model. This also creates holdout predictions for all non-time-series experiments with a single final model." }, { "output": " ``parameter_tuning_num_models``\n~\n.. dropdown:: Number of Models During Tuning Phase\n\t:open:\n\n\tSpecify the number of models to tune during pre-evolution phase. Specify a lower value to avoid excessive tuning, or specify a higher to perform enhanced tuning." }, { "output": " .. _sampling_method_for_imbalanced:\n\n``imbalance_sampling_method``\n~\n.. dropdown:: Sampling Method for Imbalanced Binary Classification Problems\n\t:open:\n\n\tSpecify the sampling method for imbalanced binary classification problems." }, { "output": " Choose from the following options:\n\n\t- auto: sample both classes as needed, depending on data\n\t- over_under_sampling: over-sample the minority class and under-sample the majority class, depending on data\n\t- under_sampling: under-sample the majority class to reach class balance\n\t- off: do not perform any sampling\n\n\tThis option is closely tied with the Imbalanced Light GBM and Imbalanced XGBoost GBM models, which can be enabled/disabled on the Recipes tab under :ref:`included_models`." }, { "output": " If the target fraction proves to be above the allowed imbalance threshold, then sampling will be triggered. - If this option is ENABLED and the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are DISABLED, then no special sampling technique will be performed." }, { "output": " ``imbalance_sampling_threshold_min_rows_original``\n\n.. dropdown:: Threshold for Minimum Number of Rows in Original Training Data to Allow Imbalanced Sampling\n\t:open:\n\n\tSpecify a threshold for the minimum number of rows in the original training data that allow imbalanced sampling." }, { "output": " ``imbalance_ratio_sampling_threshold``\n\n.. dropdown:: Ratio of Majority to Minority Class for Imbalanced Binary Classification to Trigger Special Sampling Techniques (if Enabled)\n\t:open:\n\n\tFor imbalanced binary classification problems, specify the ratio of majority to minority class." }, { "output": " This value defaults to 5. ``heavy_imbalance_ratio_sampling_threshold``\n\n.. dropdown:: Ratio of Majority to Minority Class for Heavily Imbalanced Binary Classification to Only Enable Special Sampling Techniques (if Enabled)\n\t:open:\n\n\tFor heavily imbalanced binary classification, specify the ratio of the majority to minority class equal and above which to enable only special imbalanced models on the full original data without upfront sampling." }, { "output": " ``imbalance_sampling_number_of_bags``\n~\n.. dropdown:: Number of Bags for Sampling Methods for Imbalanced Binary Classification (if Enabled)\n\t:open:\n\n\tSpecify the number of bags for sampling methods for imbalanced binary classification." }, { "output": " ``imbalance_sampling_max_number_of_bags``\n~\n.. dropdown:: Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification\n\t:open:\n\n\tSpecify the limit on the number of bags for sampling methods for imbalanced binary classification." }, { "output": " ``imbalance_sampling_max_number_of_bags_feature_evolution``\n~\n.. dropdown:: Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification During Feature Evolution Phase\n\t:open:\n\n\tSpecify the limit on the number of bags for sampling methods for imbalanced binary classification." }, { "output": " Note that this setting only applies to shift, leakage, tuning, and feature evolution models. To limit final models, use the Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification setting." }, { "output": " This setting controls the approximate number of bags and is only active when the \"Hard limit on number of bags for sampling methods for imbalanced binary classification during feature evolution phase\" option is set to -1." }, { "output": " ``imbalance_sampling_target_minority_fraction``\n~\n.. dropdown:: Target Fraction of Minority Class After Applying Under/Over-Sampling Techniques\n\t:open:\n\n\tSpecify the target fraction of a minority class after applying under/over-sampling techniques." }, { "output": " When starting from an extremely imbalanced original target, it can be advantageous to specify a smaller value such as 0.1 or 0.01. This value defaults to -1. ``ftrl_max_interaction_terms_per_degree``\n~\n.. dropdown:: Max Number of Automatic FTRL Interactions Terms for 2nd, 3rd, 4th order interactions terms (Each)\n\t:open:\n\n\tSamples the number of automatic FTRL interactions terms to no more than this value (for each of 2nd, 3rd, 4th order terms)." }, { "output": " When enabled, this setting provides error bars to validation and test scores based on the standard error of the bootstrap mean. This is enabled by default. ``tensorflow_num_classes_switch``\n~\n.. dropdown:: For Classification Problems with This Many Classes, Default to TensorFlow\n\t:open:\n\n\tSpecify the number of classes above which to use TensorFlow when it is enabled." }, { "output": " (Models set to On, however, are still used.) This value defaults to 10. .. _compute-intervals:\n\n``prediction_intervals``\n\n.. dropdown:: Compute Prediction Intervals\n\t:open:\n\n\tSpecify whether to compute empirical prediction intervals based on holdout predictions." }, { "output": " .. _confidence-level:\n\n``prediction_intervals_alpha``\n\n.. dropdown:: Confidence Level for Prediction Intervals\n\t:open:\n\n\tSpecify a confidence level for prediction intervals. This value defaults to 0.9." }, { "output": " Install the Driverless AI AWS Community AMI\n-\n\nWatch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same." }, { "output": " Log in to your AWS account at https://aws.amazon.com. 2. In the upper right corner of the Amazon Web Services page, set the location drop-down. (Note: We recommend selecting the US East region because H2O's resources are stored there." }, { "output": " .. image:: ../images/ami_location_dropdown.png\n :align: center\n\n\n3. Select the EC2 option under the Compute section to open the EC2 Dashboard. .. image:: ../images/ami_select_ec2.png\n :align: center\n\n4." }, { "output": " .. image:: ../images/ami_launch_instance_button.png\n :align: center\n\n5. Under Community AMIs, search for h2oai, and then select the version that you want to launch. .. image:: ../images/ami_select_h2oai_ami.png\n :align: center\n\n6." }, { "output": " This will ensure that your Driverless AI instance will run on GPUs. Select a GPU compute instance from the available options. (We recommend at least 32 vCPUs.) Click the Next: Configure Instance Details button." }, { "output": " Specify the Instance Details that you want to configure. Create a VPC or use an existing one, and ensure that \"Auto-Assign Public IP\" is enabled and associated to your subnet. Click Next: Add Storage." }, { "output": " Specify the Storage Device settings. Note again that Driverless AI requires 10 GB to run and will stop working of less than 10 GB is available. The machine should have a minimum of 30 GB of disk space." }, { "output": " .. image:: ../images/ami_add_storage.png\n :align: center\n\n9. If desired, add unique Tag name to identify your instance. Click Next: Configure Security Group. 10. Add the following security rules to enable SSH access to Driverless AI, then click Review and Launch." }, { "output": " 12. A popup will appear prompting you to select a key pair. This is required in order to SSH into the instance. You can select your existing key pair or create a new one. Be sure to accept the acknowledgement, then click Launch Instances to start the new instance." }, { "output": " Upon successful completion, a message will display informing you that your instance is launching. Click the View Instances button to see information about the instance including the IP address. The Connect button on this page provides information on how to SSH into your instance." }, { "output": " Open a Terminal window and SSH into the IP address of the AWS instance. Replace the DNS name below with your instance DNS. .. code-block:: bash \n\n ssh -i \"mykeypair.pem\" ubuntu@ec2-34-230-6-230.compute-1.amazonaws.com \n\n Note: If you receive a \"Permissions 0644 for \u2018mykeypair.pem\u2019 are too open\" error, run the following command to give the user read permission and remove the other permissions." }, { "output": " If you selected a GPU-compute instance, then you must enable persistence and optimizations of the GPU. The commands vary depending on the instance type. Note also that these commands need to be run once every reboot." }, { "output": " At this point, you can copy data into the data directory on the host machine using ``scp``. For example:\n\n .. code-block:: bash\n\n scp -i /path/mykeypair.pem ubuntu@ec2-34-230-6-230.compute-1.amazonaws.com:/path/to/file/to/be/copied/example.csv /path/of/destination/on/local/machine\n\n where:\n \n * ``i`` is the identify file option\n * ``mykeypair`` is the name of the private keypair file\n * ``ubuntu`` is the name of the private keypair file\n * ``ec2-34-230-6-230.compute-1.amazonaws.com`` is the public DNS name of the instance\n * ``example.csv`` is the file to transfer\n\n17." }, { "output": " Sign in to Driverless AI with the username h2oai and use the AWS InstanceID as the password. You will be prompted to enter your Driverless AI license key when you log in for the first time. .. code-block:: bash\n\n http://Your-Driverless-AI-Host-Machine:12345\n\nStopping the EC2 Instance\n~\n\nThe EC2 instance will continue to run even when you close the aws.amazon.com portal." }, { "output": " On the EC2 Dashboard, click the Running Instances link under the Resources section. 2. Select the instance that you want to stop. 3. In the Actions drop down menu, select Instance State > Stop. 4. A confirmation page will display." }, { "output": " .. _nlp-settings:\n\nNLP Settings\n\n\n``enable_tensorflow_textcnn``\n~\n.. dropdown:: Enable Word-Based CNN TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Word-based CNN TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_tensorflow_textbigru``\n~\n.. dropdown:: Enable Word-Based BiGRU TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Word-based BiG-RU TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_tensorflow_charcnn``\n~\n.. dropdown:: Enable Character-Based CNN TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Character-level CNN TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_pytorch_nlp_model``\n\n.. dropdown:: Enable PyTorch Models for NLP\n\t:open:\n\n\tSpecify whether to enable pretrained PyTorch models and fine-tune them for NLP tasks." }, { "output": " You need to set this to On if you want to use the PyTorch models like BERT for modeling. Only the first text column will be used for modeling with these models. We recommend that you disable this option on systems that do not use GPUs." }, { "output": " This is set to Auto by default, and is enabled for text-dominated problems only. You need to set this to On if you want to use the PyTorch models like BERT for feature engineering (via fitting a linear model on top of pretrained embeddings)." }, { "output": " Notes:\n\n\t- This setting requires an Internet connection. ``pytorch_nlp_pretrained_models``\n~\n.. dropdown:: Select Which Pretrained PyTorch NLP Models to Use\n\t:open:\n\n\tSpecify one or more pretrained PyTorch NLP models to use." }, { "output": " - Models that are not selected by default may not have MOJO support. - Using BERT-like models may result in a longer experiment completion time. ``tensorflow_max_epochs_nlp``\n~\n.. dropdown:: Max TensorFlow Epochs for NLP\n\t:open:\n\n\tWhen building TensorFlow NLP features (for text data), specify the maximum number of epochs to train feature engineering models with (it might stop earlier)." }, { "output": " This value defaults to 2 and is ignored if TensorFlow models is disabled. ``enable_tensorflow_nlp_accuracy_switch``\n\n.. dropdown:: Accuracy Above Enable TensorFlow NLP by Default for All Models\n\t:open:\n\n\tSpecify the accuracy threshold." }, { "output": " At lower accuracy, TensorFlow NLP transformations will only be created as a mutation. This value defaults to 5. ``pytorch_nlp_fine_tuning_num_epochs``\n\n.. dropdown:: Number of Epochs for Fine-Tuning of PyTorch NLP Models\n\t:open:\n\n\tSpecify the number of epochs used when fine-tuning PyTorch NLP models." }, { "output": " ``pytorch_nlp_fine_tuning_batch_size``\n\n.. dropdown:: Batch Size for PyTorch NLP Models\n\t:open:\n\n\tSpecify the batch size for PyTorch NLP models. This value defaults to 10. Note: Large models and batch sizes require more memory." }, { "output": " This value defaults to 100. Note: Large models and padding lengths require more memory. ``pytorch_nlp_pretrained_models_dir``\n~\n.. dropdown:: Path to Pretrained PyTorch NLP Models\n\t:open:\n\n\tSpecify a path to pretrained PyTorch NLP models." }, { "output": " Note that this can be either a path in the local file system (``/path/on/server/to/file.txt``) or an S3 location (``s3://``). Notes:\n\n\t- If an S3 location is specified, an S3 access key ID and S3 secret access key can also be specified with the :ref:`tensorflow_nlp_pretrained_s3_access_key_id` and :ref:`tensorflow_nlp_pretrained_s3_secret_access_key` expert settings respectively." }, { "output": " - You can download the fasttext embeddings from `here `__ and specify the local path in this box. - You can also train your own custom embeddings. Please refer to `this code sample `__ for creating custom embeddings that can be passed on to this option." }, { "output": " .. _tensorflow_nlp_pretrained_s3_access_key_id:\n\n``tensorflow_nlp_pretrained_s3_access_key_id``\n\n.. dropdown:: S3 access key ID to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location\n\t:open:\n\n\tSpecify an S3 access key ID to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location." }, { "output": " .. _tensorflow_nlp_pretrained_s3_secret_access_key:\n\n``tensorflow_nlp_pretrained_s3_secret_access_key``\n\n.. dropdown:: S3 secret access key to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location\n\t:open:\n\n\tSpecify an S3 secret access key to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location." }, { "output": " ``tensorflow_nlp_pretrained_embeddings_trainable``\n\n.. dropdown:: For TensorFlow NLP, Allow Training of Unfrozen Pretrained Embeddings\n\t:open:\n\n\tSpecify whether to allow training of all weights of the neural network graph, including the pretrained embedding layer weights." }, { "output": " All other weights, however, will still be fine-tuned. This is disabled by default. ``text_fraction_for_text_dominated_problem``\n\n.. dropdown:: Fraction of Text Columns Out of All Features to be Considered a Text-Dominanted Problem\n\t:open:\n\n\tSpecify the fraction of text columns out of all features to be considered as a text-dominated problem." }, { "output": " Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable. Higher values will favor string columns as categoricals, while lower values will favor string columns as text." }, { "output": " ``text_transformer_fraction_for_text_dominated_problem``\n\n.. dropdown:: Fraction of Text per All Transformers to Trigger That Text Dominated\n\t:open:\n\n\tSpecify the fraction of text columns out of all features to be considered a text-dominated problem." }, { "output": " ``string_col_as_text_threshold``\n\n.. dropdown:: Threshold for String Columns to be Treated as Text\n\t:open:\n\n\tSpecify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string)." }, { "output": " ``text_transformers_max_vocabulary_size``\n~\n.. dropdown:: Max Size of the Vocabulary for Text Transformers\n\t:open:\n\n\tMax number of tokens created during fitting of Tfidf/Count based text transformers." }, { "output": " .. _quick-start-tables:\n\nQuick-Start Tables by Environment\n-\n\nUse the following tables for Cloud, Server, and Desktop to find the right setup instructions for your environment. Cloud\n~\n\nRefer to the following for more information about instance types:\n\n- `AWS Instance Types `__\n- `Azure Instance Types `__\n- `Google Compute Instance Types `__\n\n++-++-++\n| Provider | Instance Type | Num GPUs | Suitable for | Refer to Section |\n++=++=++\n| NVIDIA GPU Cloud | | | Serious use | :ref:`install-on-nvidia-dgx` |\n++-++-++\n| AWS | p2.xlarge | 1 | Experimentation | :ref:`install-on-aws` |\n| +-++-+ |\n| | p2.8xlarge | 8 | Serious use | |\n| +-++-+ |\n| | p2.16xlarge | 16 | Serious use | |\n| +-++-+ |\n| | p3.2xlarge | 1 | Experimentation | |\n| +-++-+ |\n| | p3.8xlarge | 4 | Serious use | |\n| +-++-+ |\n| | p3.16xlarge | 8 | Serious use | |\n| +-++-+ |\n| | g3.4xlarge | 1 | Experimentation | |\n| +-++-+ |\n| | g3.8xlarge | 2 | Experimentation | |\n| +-++-+ |\n| | g3.16xlarge | 4 | Serious use | |\n++-++-++\n| Azure | Standard_NV6 | 1 | Experimentation | :ref:`install-on-azure` |\n| +-++-+ |\n| | Standard_NV12 | 2 | Experimentation | |\n| +-++-+ |\n| | Standard_NV24 | 4 | Serious use | |\n| +-++-+ |\n| | Standard_NC6 | 1 | Experimentation | |\n| +-++-+ |\n| | Standard_NC12 | 2 | Experimentation | |\n| +-++-+ |\n| | Standard_NC24 | 4 | Serious use | |\n++-++-++\n| Google Compute | | :ref:`install-on-google-compute` |\n++-++-++\n\nServer\n\n\n+-+-+-++\n| Operating System | GPUs?" }, { "output": " JDBC Setup\n\n\nDriverless AI lets you explore Java Database Connectivity (JDBC) data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with JDBC." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Tested Databases\n\n\nThe following databases have been tested for minimal functionality. Note that JDBC drivers that are not included in this list should work with Driverless AI." }, { "output": " See the :ref:`untested-jdbc-driver` section at the end of this chapter for information on how to try out an untested JDBC driver. - Oracle DB\n- PostgreSQL\n- Amazon Redshift\n- Teradata\n\nDescription of Configuration Attributes\n~\n \n- ``jdbc_app_configs``: Configuration for the JDBC connector." }, { "output": " Note: This requires a JSON key (typically the name of the database being configured) to be associated with a nested JSON that contains the ``url``, ``jarpath``, and ``classpath`` fields. In addition, this should take the format:\n\n ::\n\n \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\", \n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n\n For example:\n\n ::\n\n \"\"\"{\n \"postgres\": {\n \"url\": \"jdbc:postgresql://ip address:port/postgres\",\n \"jarpath\": \"/path/to/postgres_driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"\n },\n \"mysql\": {\n \"url\":\"mysql connection string\",\n \"jarpath\": \"/path/to/mysql_driver.jar\",\n \"classpath\": \"my.sql.classpath.Driver\"\n }\n }\"\"\"\n\n \\ Note: The expected input of ``jdbc_app_configs`` is a `JSON string `__." }, { "output": " Depending on how the configuration value is applied, different forms of outer quotations may be required. The following examples show two unique methods for applying outer quotations. - Configuration value applied with the config.toml file:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}\"\"\"\n\n - Configuration value applied with an environment variable:\n \n ::\n \n DRIVERLESS_AI_JDBC_APP_CONFIGS='{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}'\n \n For example:\n \n ::\n \n DRIVERLESS_AI_JDBC_APP_CONFIGS='{\n \"postgres\": {\"url\": \"jdbc:postgresql://192.xxx.x.xxx:aaaa:/name_of_database;user=name_of_user;password=your_password\",\"jarpath\": \"/config/postgresql-xx.x.x.jar\",\"classpath\": \"org.postgresql.Driver\"}, \n \"postgres-local\": {\"url\": \"jdbc:postgresql://123.xxx.xxx.xxx:aaaa/name_of_database\",\"jarpath\": \"/config/postgresql-xx.x.x.jar\",\"classpath\": \"org.postgresql.Driver\"},\n \"ms-sql\": {\"url\": \"jdbc:sqlserver://192.xxx.x.xxx:aaaa;databaseName=name_of_database;user=name_of_user;password=your_password\",\"Username\":\"your_username\",\"passsword\":\"your_password\",\"jarpath\": \"/config/sqljdbc42.jar\",\"classpath\": \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"},\n \"oracle\": {\"url\": \"jdbc:oracle:thin:@192.xxx.x.xxx:aaaa/orclpdb1\",\"jarpath\": \"ojdbc7.jar\",\"classpath\": \"oracle.jdbc.OracleDriver\"},\n \"db2\": {\"url\": \"jdbc:db2://127.x.x.x:aaaaa/name_of_database\",\"jarpath\": \"db2jcc4.jar\",\"classpath\": \"com.ibm.db2.jcc.DB2Driver\"},\n \"mysql\": {\"url\": \"jdbc:mysql://192.xxx.x.xxx:aaaa;\",\"jarpath\": \"mysql-connector.jar\",\"classpath\": \"com.mysql.jdbc.Driver\"},\n \"Snowflake\": {\"url\": \"jdbc:snowflake://.snowflakecomputing.com/?\",\"jarpath\": \"/config/snowflake-jdbc-x.x.x.jar\",\"classpath\": \"net.snowflake.client.jdbc.SnowflakeDriver\"},\n \"Derby\": {\"url\": \"jdbc:derby://127.x.x.x:aaaa/name_of_database\",\"jarpath\": \"/config/derbyclient.jar\",\"classpath\": \"org.apache.derby.jdbc.ClientDriver\"}\n }'\\\n\n- ``jdbc_app_jvm_args``: Extra jvm args for JDBC connector." }, { "output": " - ``jdbc_app_classpath``: Optionally specify an alternative classpath for the JDBC connector. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly." }, { "output": " Download JDBC Driver JAR files:\n\n - `Oracle DB `_\n\n - `PostgreSQL `_\n\n - `Amazon Redshift `_\n\n - `Teradata `_\n\n Note: Remember to take note of the driver classpath, as it is needed for the configuration steps (for example, org.postgresql.Driver)." }, { "output": " Copy the driver JAR to a location that can be mounted into the Docker container. Note: The folder storing the JDBC jar file must be visible/readable by the dai process user. Enable the JDBC Connector\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the JDBC connector for PostgresQL." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs,jdbc\" \\\n -e DRIVERLESS_AI_JDBC_APP_CONFIGS='{\"postgres\": \n {\"url\": \"jdbc:postgres://localhost:5432/my_database\", \n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\", \n \"classpath\": \"org.postgresql.Driver\"}}' \\ \n -e DRIVERLESS_AI_JDBC_APP_JVM_ARGS=\"-Xmx2g\" \\\n -p 12345:12345 \\\n -v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure JDBC options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n .. code-block:: bash \n\n enabled_file_systems = \"file, upload, jdbc\"\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgres://localhost:5432/my_database\",\n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the JDBC connector for PostgresQL." }, { "output": " - The configuration requires a JSON key (typically the name of the database being configured) to be associated with a nested JSON that contains the ``url``, ``jarpath``, and ``classpath`` fields. In addition, this should take the format:\n\n ::\n\n \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\", \n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n\n 1." }, { "output": " For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"upload, file, hdfs, jdbc\"\n\n # Configuration for JDBC Connector." }, { "output": " # Format as a single line without using carriage returns (the following example is formatted for readability). # Use triple quotations to ensure that the text is read as a single string. # Example:\n # \"\"\"{\n # \"postgres\": {\n # \"url\": \"jdbc:postgresql://ip address:port/postgres\",\n # \"jarpath\": \"/path/to/postgres_driver.jar\",\n # \"classpath\": \"org.postgresql.Driver\"\n # },\n # \"mysql\": {\n # \"url\":\"mysql connection string\",\n # \"jarpath\": \"/path/to/mysql_driver.jar\",\n # \"classpath\": \"my.sql.classpath.Driver\"\n # }\n # }\"\"\"\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgres://localhost:5432/my_database\",\n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n # extra jvm args for jdbc connector\n jdbc_app_jvm_args = \"\"\n\n # alternative classpath for jdbc connector\n jdbc_app_classpath = \"\"\n\n 3." }, { "output": " Adding Datasets Using JDBC\n\n\nAfter the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and Drop) drop-down menu. .. figure:: ../images/jdbc.png\n :alt: Make JDBC Query\n :scale: 30%\n\n1." }, { "output": " 2. Select JDBC from the list that appears. 3. Click on the Select JDBC Connection button to select a JDBC configuration. 4. The form will populate with the JDBC Database, URL, Driver, and Jar information." }, { "output": " - JDBC Password: Enter your JDBC password. (See the *Notes* section)\n\n - Destination Name: Enter a name for the new dataset. - (Optional) ID Column Name: Enter a name for the ID column. Specify this field when making large data queries." }, { "output": " Instead, enter the password in the JDBC Password field. The password is entered separately for security purposes. - Due to resource sharing within Driverless AI, the JDBC Connector is only allocated a relatively small amount of memory." }, { "output": " This ensures that the maximum memory allocation is not exceeded. - If a query that is larger than the maximum memory allocation is made without specifying an ID column, the query will not complete successfully." }, { "output": " Write a SQL Query in the format of the database that you want to query. (See the `Query Examples <#queryexamples>`__ section below.) The format will vary depending on the database that is used. 6. Click the Click to Make Query button to execute the query." }, { "output": " On a successful query, you will be returned to the datasets page, and the queried data will be available as a new dataset. .. _queryexamples:\n\nQuery Examples\n\n\nThe following are sample configurations and queries for Oracle DB and PostgreSQL:\n\n.. tabs:: \n .. group-tab:: Oracle DB\n\n 1." }, { "output": " Sample Query:\n\n - Select oracledb from the Select JDBC Connection dropdown menu. - JDBC Username: ``oracleuser``\n - JDBC Password: ``oracleuserpassword``\n - ID Column Name:\n - Query:\n\n ::\n\n SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION\n\n Note: Because this query does not specify an ID Column Name, it will only work for small data." }, { "output": " 3. Click the Click to Make Query button to execute the query. .. group-tab:: PostgreSQL \n\n 1. Configuration:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgresql://localhost:5432/postgresdatabase\", \"jarpath\": \"/home/ubuntu/postgres-artifacts/postgres/Driver.jar\", \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n 2." }, { "output": " - JDBC Username: ``postgres_user``\n - JDBC Password: ``pguserpassword``\n - ID Column Name: ``id``\n - Query:\n\n ::\n\n SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing value 5)\n\n 3." }, { "output": " .. _untested-jdbc-driver:\n\nAdding an Untested JDBC Driver\n\n\nWe encourage you to try out JDBC drivers that are not tested in house. .. tabs:: \n .. group-tab:: Docker Image Installs\n\n 1. Download the JDBC jar for your database." }, { "output": " Move your JDBC jar file to a location that DAI can access. 3. Start the Driverless AI Docker image using the JDBC-specific environment variables. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"upload,file,hdfs,s3,recipe_file,jdbc\" \\\n -e DRIVERLESS_AI_JDBC_APP_CONFIGS=\"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\\ \n -e DRIVERLESS_AI_JDBC_APP_JVM_ARGS=\"-Xmx2g\" \\\n -p 12345:12345 \\\n -v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n 1." }, { "output": " 2. Move your JDBC jar file to a location that DAI can access. 3. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n .. code-block:: bash \n\n enabled_file_systems = \"upload, file, hdfs, s3, recipe_file, jdbc\"\n jdbc_app_configs = \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n #Optional arguments\n jdbc_app_jvm_args = \"\"\n jdbc_app_classpath = \"\"\n\n 4." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " 2. Move your JDBC jar file to a location that DAI can access. 3. Modify the following config.toml settings. Note that these can also be specified as environment variables when starting Driverless AI in Docker:\n\n ::\n\n # enable the JDBC file system\n enabled_file_systems = \"upload, file, hdfs, s3, recipe_file, jdbc\"\n\n # Configure the JDBC Connector." }, { "output": " # Format as a single line without using carriage returns (the following example is formatted for readability). # Use triple quotations to ensure that the text is read as a single string. # Example:\n jdbc_app_configs = \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n\n # optional extra jvm args for jdbc connector\n jdbc_app_jvm_args = \"\"\n\n # optional alternative classpath for jdbc connector\n jdbc_app_classpath = \"\"\n\n 4." }, { "output": " MinIO Setup\n-\n\nThis section provides instructions for configuring Driverless AI to work with `MinIO `__. Note that unlike S3, authentication must also be configured when the MinIO data connector is specified." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``minio_endpoint_url``: The endpoint URL that will be used to access MinIO. - ``minio_access_key_id``: The MinIO access key." }, { "output": " - ``minio_skip_cert_verification``: If this is set to true, then MinIO connector will skip certificate verification. This is set to false by default. - ``enabled_file_systems``: The file systems you want to enable." }, { "output": " Enable MinIO with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the MinIO data connector with authentication by passing an endpoint URL, access key ID, and an access key." }, { "output": " This lets you reference data stored in MinIO directly using the endpoint URL, for example: http:////datasets/iris.csv. .. code-block:: bash\n :substitutions:\n\n \t nvidia-docker run \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,minio\" \\\n -e DRIVERLESS_AI_MINIO_ENDPOINT_URL=\"\"\n -e DRIVERLESS_AI_MINIO_ACCESS_KEY_ID=\"\" \\\n -e DRIVERLESS_AI_MINIO_SECRET_ACCESS_KEY=\"\" \\ \n -e DRIVERLESS_AI_MINIO_SKIP_CERT_VERIFICATION=\"false\" \\\n -p 12345:12345 \\\n init -it rm \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure MinIO options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, minio\"``\n - ``minio_endpoint_url = \"\"``\n - ``minio_access_key_id = \"\"``\n - ``minio_secret_access_key = \"\"``\n - ``minio_skip_cert_verification = \"false\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Native Installs\n\n This example enables the MinIO data connector with authentication by passing an endpoint URL, access key ID, and an access key." }, { "output": " This allows users to reference data stored in MinIO directly using the endpoint URL, for example: http:////datasets/iris.csv. 1. Export the Driverless AI config.toml file or add it to ~/.bashrc." }, { "output": " Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : MinIO Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, minio\"\n\n # MinIO Connector credentials\n minio_endpoint_url = \"\"\n minio_access_key_id = \"\"\n minio_secret_access_key = \"\"\n minio_skip_cert_verification = \"false\"\n\n 3." }, { "output": " .. _install-on-azure:\n\nInstall on Azure\n\n\nThis section describes how to install the Driverless AI image from Azure. Note: Prior versions of the Driverless AI installation and upgrade on Azure were done via Docker." }, { "output": " Watch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same." }, { "output": " Log in to your Azure portal at https://portal.azure.com, and click the Create a Resource button. 2. Search for and select H2O DriverlessAI in the Marketplace. .. image:: ../images/azure_select_driverless_ai.png\n :align: center\n\n3." }, { "output": " This launches the H2O DriverlessAI Virtual Machine creation process. .. image:: ../images/azure_search_for_dai.png\n :align: center\n\n4. On the Basics tab:\n\n a. Enter a name for the VM. b. Select the Disk Type for the VM." }, { "output": " c. Enter the name that you will use when connecting to the machine through SSH. d. Enter and confirm a password that will be used when connecting to the machine through SSH. e. Specify the Subscription option." }, { "output": " f. Enter a name unique name for the resource group. g. Specify the VM region. Click OK when you are done. .. image:: ../images/azure_basics_tab.png\n :align: center\n\n5. On the Size tab, select your virtual machine size." }, { "output": " We recommend using an N-Series type, which comes with a GPU. Also note that Driverless AI requires 10 GB of free space in order to run and will stop working of less than 10 GB is available. We recommend a minimum of 30 GB of disk space." }, { "output": " .. image:: ../images/azure_vm_size.png\n :align: center\n\n6. On the Settings tab, select or create the Virtual Network and Subnet where the VM is going to be located and then click OK.\n\n .. image:: ../images/azure_settings_tab.png\n :align: center\n\n7." }, { "output": " When the validation passes successfully, click Create to create the VM. .. image:: ../images/azure_summary_tab.png\n :align: center\n\n8. After the VM is created, it will be available under the list of Virtual Machines." }, { "output": " 9. Connect to Driverless AI with your browser using the IP address retrieved in the previous step. .. code-block:: bash\n\n http://Your-Driverless-AI-Host-Machine:12345\n\n\nStopping the Azure Instance\n~\n\nThe Azure instance will continue to run even when you close the Azure portal." }, { "output": " Click the Virtual Machines left menu item. 2. Select the checkbox beside your DriverlessAI virtual machine. 3. On the right side of the row, click the ... button, then select Stop. (Note that you can then restart this by selecting Start.)" }, { "output": " \nUpgrading the Driverless AI Community Image\n~\n\n.. include:: upgrade-warning.frag\n\nUpgrading from Version 1.2.2 or Earlier\n'\n\nThe following example shows how to upgrade from 1.2.2 or earlier to the current version." }, { "output": " 1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:\n\n .. code-block:: bash\n\n # Set up a directory of the previous version name\n mkdir dai_rel_1.2.2\n\n # Copy the data, log, license, and tmp directories as backup\n cp -a ./data dai_rel_1.2.2/data\n cp -a ./log dai_rel_1.2.2/log\n cp -a ./license dai_rel_1.2.2/license\n cp -a ./tmp dai_rel_1.2.2/tmp\n\n2." }, { "output": " The command below retrieves version 1.2.2:\n\n .. code-block:: bash\n\n wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.2.2-6/x86_64-centos7/dai-docker-centos7-x86_64-1.2.2-9.0.tar.gz\n\n3." }, { "output": " 4. Use the ``docker load`` command to load the image:\n\n .. code-block:: bash\n\n docker load < ami-0c50db5e1999408a7\n\n5. Optionally run ``docker images`` to ensure that the new image is in the registry." }, { "output": " Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345. Upgrading from Version 1.3.0 or Later\n\n\nThe following example shows how to upgrade from version 1.3.0. 1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:\n\n .. code-block:: bash\n\n # Set up a directory of the previous version name\n mkdir dai_rel_1.3.0\n\n # Copy the data, log, license, and tmp directories as backup\n cp -a ./data dai_rel_1.3.0/data\n cp -a ./log dai_rel_1.3.0/log\n cp -a ./license dai_rel_1.3.0/license\n cp -a ./tmp dai_rel_1.3.0/tmp\n\n2." }, { "output": " Replace VERSION and BUILD below with the Driverless AI version. .. code-block:: bash\n\n wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/VERSION-BUILD/x86_64/dai-ubi8-centos7-x86_64-VERSION.tar.gz\n\n3." }, { "output": " .. _gbq:\n\nGoogle BigQuery Setup\n#####################\n\nDriverless AI lets you explore Google BigQuery (GBQ) data sources from within the Driverless AI application. This page provides instructions for configuring Driverless AI to work with GBQ." }, { "output": " Enabling the GCS and/or GBQ connectors causes those file systems to be displayed in the UI, but the GCS and GBQ connectors cannot be used without first enabling authentication. Before enabling the GBQ data connector with authentication, the following steps must be performed:\n\n1." }, { "output": " To create a private key, click Service Accounts > Keys, and then click the Add Key button. When the Create private key dialog appears, select JSON as the key type. To finish creating the JSON private key and download it to your local file system, click Create." }, { "output": " Mount the downloaded JSON file to the Docker instance. 3. Specify the path to the downloaded and mounted ``auth-key.json`` file with the ``gcs_path_to_service_account_json`` config option. .. note::\n\tDepending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " The following sections describe how to enable the GBQ data connector:\n\n- :ref:`gbq-config-toml`\n- :ref:`gbq-environment-variable`\n- :ref:`gbq-workload-identity`\n\n.. _gbq-config-toml:\n\nEnabling GBQ with the config.toml file\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the GBQ data connector with authentication by passing the JSON authentication file." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,gbq\" \\\n -e DRIVERLESS_AI_GCS_PATH_TO_SERVICE_ACCOUNT_JSON=\"/service_account_json.json\" \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v `pwd`/service_account_json.json:/service_account_json.json \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure the GBQ data connector options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, gbq\"``\n - ``gcs_path_to_service_account_json = \"/service_account_json.json\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the GBQ data connector with authentication by passing the JSON authentication file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # file : local file system/server file system\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n enabled_file_systems = \"file, gbq\"\n\n # GCS Connector credentials\n # example (suggested) \"/licenses/my_service_account_json.json\"\n gcs_path_to_service_account_json = \"/service_account_json.json\"\n\n 3." }, { "output": " .. _gbq-environment-variable:\n\nEnabling GBQ by setting an environment variable\n*\n\nThe GBQ data connector can be configured by setting the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as follows:\n\n::\n\n export GOOGLE_APPLICATION_CREDENTIALS=\"SERVICE_ACCOUNT_KEY_PATH\"\n\nIn the preceding example, replace ``SERVICE_ACCOUNT_KEY_PATH`` with the path of the JSON file that contains your service account key." }, { "output": " .. _gbq-workload-identity:\n\nEnabling GBQ by enabling Workload Identity for your GKE cluster\n*\n\nThe GBQ data connector can be configured by enabling Workload Identity for your Google Kubernetes Engine (GKE) cluster." }, { "output": " .. note::\n\tIf Workload Identity is enabled, then the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable does not need to be set. Adding Datasets Using GBQ\n*\n\nAfter Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " .. figure:: ../images/add_dataset_dropdown.png\n :alt: Add Dataset\n :scale: 40\n\nSpecify the following information to add your dataset:\n\n1. Enter BQ Dataset ID with write access to create temporary table: Enter a dataset ID in Google BigQuery that this user has read/write access to." }, { "output": " Note: Driverless AI's connection to GBQ will inherit the top-level directory from the service JSON file. So if a dataset named \"my-dataset\" is in a top-level directory named \"dai-gbq\", then the value for the dataset ID input field would be \"my-dataset\" and not \"dai-gbq:my-dataset\"." }, { "output": " Enter Google Storage destination bucket: Specify the name of Google Cloud Storage destination bucket. Note that the user must have write access to this bucket. 3. Enter Name for Dataset to be saved as: Specify a name for the dataset, for example, ``my_file``." }, { "output": " Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute. For example: ``SELECT * FROM .``. 5. (Optional) Specify a project to use with the GBQ connector." }, { "output": " Linux Docker Images\n-\n\nTo simplify local installation, Driverless AI is provided as a Docker image for the following system combinations:\n\n+-++-+-+\n| Host OS | Docker Version | Host Architecture | Min Mem |\n+=++=+=+\n| Ubuntu 16.04 or later | Docker CE | x86_64 | 64 GB |\n+-++-+-+\n| RHEL or CentOS 7.4 or later | Docker CE | x86_64 | 64 GB |\n+-++-+-+\n| NVIDIA DGX Registry | | x86_64 | |\n+-++-+-+\n\nNote: CUDA 11.2.2 or later with NVIDIA drivers >= |NVIDIA-driver-ver| is recommended (GPU only)." }, { "output": " For the best performance, including GPU support, use nvidia-docker. For a lower-performance experience without GPUs, use regular docker (with the same docker image). These installation steps assume that you have a license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " For GPU users, as GPU needs ``pid=host`` for nvml, which makes tini not use pid=1, so it will show the warning message (still harmless). We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " \nThis section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These steps ensure that existing experiments are saved. WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded when Driverless AI is upgraded." }, { "output": " - Build MOJO pipelines before upgrading. - Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading. If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view MLI on that model after upgrading." }, { "output": " If that MLI job appears in the list of Interpreted Models in your current version, then it will be retained after upgrading. If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be able to build a MOJO pipeline on that model after upgrading." }, { "output": " Note: Stop Driverless AI if it is still running. Requirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers. For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ ." }, { "output": " Upgrade Steps\n'\n\n1. SSH into the IP address of the machine that is running Driverless AI. 2. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n # cd into the new directory\n cd |VERSION-dir|\n\n3." }, { "output": " 4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " Install the Driverless AI AWS Marketplace AMI\n-\n\nA Driverless AI AMI is available in the AWS Marketplace beginning with Driverless AI version 1.5.2. This section describes how to install and run Driverless AI through the AWS Marketplace." }, { "output": " Log in to the `AWS Marketplace `__. 2. Search for Driverless AI. .. figure:: ../images/aws-marketplace-search.png\n :alt: Search for Driverless AI\n\n3. Select the version of Driverless AI that you want to install." }, { "output": " Scroll down to review/edit your region and the selected infrastructure and pricing. .. figure:: ../images/aws-marketplace-pricing-info.png\n :alt: Review pricing \n\n5. Return to the top and select Continue to Subscribe." }, { "output": " 7. If desired, change the Fullfillment Option, Software Version, and Region. Note that this page also includes the AMI ID for the selected software version. Click Continue to Launch when you are done." }, { "output": " Click the Usage Instructions button in AWS to review your Driverless AI username and password. Scroll down to the bottom of the page and click Launch when you are done. .. figure:: ../images/aws-marketplace-launch.png\n :alt: Launch options\n\nYou will receive a \"Success\" message when the image launches successfully." }, { "output": " 1. Navigate to the `EC2 Console `__. 2. Select your instance. 3. Open another browser and launch Driverless AI by navigating to https://:12345." }, { "output": " Sign in to Driverless AI with the username h2oai and use the AWS InstanceID as the password. You will be prompted to enter your Driverless AI license key when you log in for the first time. Stopping the EC2 Instance\n~\n\nThe EC2 instance will continue to run even when you close the aws.amazon.com portal." }, { "output": " On the EC2 Dashboard, click the Running Instances link under the Resources section. 2. Select the instance that you want to stop. 3. In the Actions drop down menu, select Instance State > Stop. 4. A confirmation page will display." }, { "output": " Upgrading the Driverless AI Marketplace Image\n\n\nNote that the first offering of the Driverless AI Marketplace image was 1.5.2. As such, it is only possible to upgrade to versions greater than that. Perform the following steps if you are upgrading to a Driverless AI Marketeplace image version greater than 1.5.2." }, { "output": " Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade." }, { "output": " .. _install-on-google-compute:\n\nInstall on Google Compute\n-\n\nDriverless AI can be installed on Google Compute using one of two methods:\n\n- Install the Google Cloud Platform offering. This installs Driverless AI via the available GCP Marketplace offering." }, { "output": " kdb+ Setup\n\n\nDriverless AI lets you explore `kdb+ `__ data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with kdb+." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``kdb_user``: (Optional) User name \n- ``kdb_password``: (Optional) User's password\n- ``kdb_hostname``: IP address or host of the KDB server\n- ``kdb_port``: Port on which the kdb+ server is listening\n- ``kdb_app_jvm_args``: (Optional) JVM args for kdb+ distributions (for example, ``-Dlog4j.configuration``)." }, { "output": " - ``kdb_app_classpath``: (Optional) The kdb+ classpath (or other if the jar file is stored elsewhere). - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly." }, { "output": " The only required flags are the hostname and the port. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,kdb\" \\\n -e DRIVERLESS_AI_KDB_HOSTNAME=\"\" \\\n -e DRIVERLESS_AI_KDB_PORT=\"\" \\\n -p 12345:12345 \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure kdb+ options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, kdb\"``\n - ``kdb_hostname = \"``\n - ``kdb_port = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the kdb+ connector without authentication." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, kdb\"\n\n # KDB Connector credentials\n kdb_hostname = \"\n kdb_port = \"\"\n\n 3." }, { "output": " Example 2: Enable kdb+ with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example provides users credentials for accessing a kdb+ server from Driverless AI. .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,kdb\" \\\n -e DRIVERLESS_AI_KDB_HOSTNAME=\"\" \\\n -e DRIVERLESS_AI_KDB_PORT=\"\" \\\n -e DRIVERLESS_AI_KDB_USER=\"\" \\\n -e DRIVERLESS_AI_KDB_PASSWORD=\"\" \\\n -p 12345:12345 \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure kdb+ options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, kdb\"``\n - ``kdb_user = \"\"``\n - ``kdb_password = \"\"``\n - ``kdb_hostname = \"``\n - ``kdb_port = \"\"``\n - ``kdb_app_classpath = \"\"``\n - ``kdb_app_jvm_args = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example provides users credentials for accessing a kdb+ server from Driverless AI." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, kdb\"\n\n # kdb+ Connector credentials\n kdb_user = \"\"\n kdb_password = \"\"\n kdb_hostname = \"\n kdb_port = \"\"\n kdb_app_classpath = \"\"\n kdb_app_jvm_args = \"\"\n\n 3." }, { "output": " Adding Datasets Using kdb+\n\n\nAfter the kdb+ connector is enabled, you can add datasets by selecting kdb+ from the Add Dataset (or Drag and Drop) drop-down menu. .. figure:: ../images/add_dataset_dropdown.png\n :alt: Add Dataset\n :height: 338\n :width: 237\n\nSpecify the following information to add your dataset." }, { "output": " Enter filepath to save query. Enter the local file path for storing your dataset. For example, /home//myfile.csv. Note that this can only be a CSV file. 2. Enter KDB Query: Enter a kdb+ query that you want to execute." }, { "output": " Data Recipe File Setup\n\n\nDriverless AI lets you explore data recipe file data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with local data recipe files." }, { "output": " (Refer to :ref:`modify_by_recipe` for more information.) Notes:\n\n- This connector is enabled by default. These steps are provided in case this connector was previously disabled and you want to re-enable it." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Enable Data Recipe File\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the data recipe file data connector." }, { "output": " Note that ``recipe_file`` is enabled in the config.toml file by default. 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, recipe_file\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Upload Data Recipe data connector." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2." }, { "output": " ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " BlueData DataTap Setup\n\n\nThis section provides instructions for configuring Driverless AI to work with BlueData DataTap. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``dtap_auth_type``: Selects DTAP authentication. Available values are:\n\n - ``noauth``: No authentication needed\n - ``principal``: Authenticate with DataTap with a principal user\n - ``keytab``: Authenticate with a Key tab (recommended)." }, { "output": " - ``keytabimpersonation``: Login with impersonation using a keytab\n\n- ``dtap_config_path``: The location of the DTAP (HDFS) config folder path. This folder can contain multiple config files. Note: The DTAP config file core-site.xml needs to contain DTap FS configuration, for example:\n\n ::\n\n \n \n fs.dtap.impl\n com.bluedata.hadoop.bdfs.Bdfs\n The FileSystem for BlueData dtap: URIs.\n \n \n\n- ``dtap_key_tab_path``: The path of the principal key tab file." }, { "output": " - ``dtap_app_principal_user``: The Kerberos app principal user (recommended). - ``dtap_app_login_user``: The user ID of the current user (for example, user@realm). - ``dtap_app_jvm_args``: JVM args for DTap distributions. Separate each argument with spaces. - ``dtap_app_classpath``: The DTap classpath. - ``dtap_init_path``: Specifies the starting DTAP path displayed in the UI of the DTAP browser. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly." }, { "output": " It does not pass any configuration file; however it configures Docker DNS by passing the name and IP of the DTap name node. This lets users reference data stored in DTap directly using the name node address, for example: ``dtap://name.node/datasets/iris.csv`` or ``dtap://name.node/datasets/``. (Note: The trailing slash is currently required for directories.) .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,dtap\" \\\n -e DRIVERLESS_AI_DTAP_AUTH_TYPE='noauth' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure DataTap options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the DataTap data connector and disables authentication in the config.toml file." }, { "output": " (Note: The trailing slash is currently required for directories.) 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n enabled_file_systems = \"file, dtap\"\n\n 3." }, { "output": " Example 2: Enable DataTap with Keytab-Based Authentication\n\n\nNotes: \n\n- If using Kerberos Authentication, the the time on the Driverless AI server must be in sync with Kerberos server. If the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures. - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user; otherwise Driverless AI will not be able to read/access the Keytab and will result in a fallback to simple authentication and, hence, fail." }, { "output": " - Configures the environment variable ``DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER`` to reference a user for whom the keytab was created (usually in the form of user@realm). .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,dtap\" \\\n -e DRIVERLESS_AI_DTAP_AUTH_TYPE='keytab' \\\n -e DRIVERLESS_AI_DTAP_KEY_TAB_PATH='tmp/<>' \\\n -e DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER='<>' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n - ``dtap_auth_type = \"keytab\"``\n - ``dtap_key_tab_path = \"/tmp/\"``\n - ``dtap_app_principal_user = \"\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # file : local file system/server file system\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n enabled_file_systems = \"file, dtap\"\n\n # Blue Data DTap connector settings are similar to HDFS connector settings." }, { "output": " If running\n # DAI as a service, then the Kerberos keytab needs to\n # be owned by the DAI user. # keytabimpersonation : Login with impersonation using a keytab\n dtap_auth_type = \"keytab\"\n\n # Path of the principal key tab file\n dtap_key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n dtap_app_principal_user = \"\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI." }, { "output": " - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user. .. tabs::\n .. group-tab:: Docker Image Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below. - Configures the ``DRIVERLESS_AI_DTAP_APP_PRINCIPAL_USER`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). - Configures the ``DRIVERLESS_AI_DTAP_APP_LOGIN_USER`` variable, which references a user who is being impersonated (usually in the form of user@realm)." }, { "output": " - Configures the ``dtap_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). - Configures the ``dtap_app_login_user`` variable, which references a user who is being impersonated (usually in the form of user@realm). 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, dtap\"``\n - ``dtap_auth_type = \"keytabimpersonation\"``\n - ``dtap_key_tab_path = \"/tmp/\"``\n - ``dtap_app_principal_user = \"\"``\n - ``dtap_app_login_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " - Configures the ``dtap_app_login_user`` variable, which references a user who is being impersonated (usually in the form of user@realm). 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file." }, { "output": " (jdbc_app_configs)\n # hive: Hive Connector, remember to configure Hive below. (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, dtap\"\n\n # Blue Data DTap connector settings are similar to HDFS connector settings. #\n # Specify DTap Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with DTab with a principal user\n # keytab : Authenticate with a Key tab (recommended)." }, { "output": " Data Recipe URL Setup\n-\n\nDriverless AI lets you explore data recipe URL data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with data recipe URLs. When enabled (default), you will be able to modify datasets that have been added to Driverless AI. (Refer to :ref:`modify_by_recipe` for more information.) Notes:\n\n- This connector is enabled by default. These steps are provided in case this connector was previously disabled and you want to re-enable it." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Enable Data Recipe URL\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the data recipe URL data connector. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file, recipe_url\" \\\n -p 12345:12345 \\\n -it rm \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to enable the Data Recipe URL data connector in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, recipe_url\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Data Recipe URL data connector." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " AutoDoc Settings\n\n\nThis section includes settings that can be used to configure AutoDoc. ``make_autoreport``\n~\n\n.. dropdown:: Make AutoDoc\n\t:open:\n\n\tSpecify whether to create an AutoDoc for the experiment after it has finished running. This is enabled by default. ``autodoc_report_name``\n~\n\n.. dropdown:: AutoDoc Name\n\t:open:\n\n\tSpecify a name for the AutoDoc report. This is set to \"report\" by default. ``autodoc_template``\n\n\n.. dropdown:: AutoDoc Template Location\n\t:open:\n\n\tSpecify a path for the AutoDoc template:\n\n\t- To generate a custom AutoDoc template, specify the full path to your custom template." }, { "output": " ``autodoc_output_type``\n~\n\n.. dropdown:: AutoDoc File Output Type\n\t:open:\n\n\tSpecify the AutoDoc output type. Choose from the following file types:\n\n\t- docx (Default)\n\t- md\n\n``autodoc_subtemplate_type``\n\n\n.. dropdown:: AutoDoc SubTemplate Type\n\t:open:\n\n\tSpecify the type of sub-templates to use. Choose from the following:\n\n\t- auto (Default)\n\t- md\n\t- docx\n\n``autodoc_max_cm_size``\n~\n\n.. dropdown:: Confusion Matrix Max Number of Classes\n\t:open:\n\n\tSpecify the maximum number of classes in the confusion matrix." }, { "output": " ``autodoc_num_features``\n\n\n.. dropdown:: Number of Top Features to Document\n\t:open:\n\n\tSpecify the number of top features to display in the document. To disable this setting, specify -1. This is set to 50 by default. ``autodoc_min_relative_importance``\n~\n\n.. dropdown:: Minimum Relative Feature Importance Threshold\n\t:open:\n\n\tSpecify the minimum relative feature importance in order for a feature to be displayed. This value must be a float >= 0 and <= 1. This is set to 0.003 by default. ``autodoc_include_permutation_feature_importance``\n\n\n.. dropdown:: Permutation Feature Importance\n\t:open:\n\n\tSpecify whether to compute permutation-based feature importance." }, { "output": " ``autodoc_feature_importance_num_perm``\n~\n\n.. dropdown:: Number of Permutations for Feature Importance\n\t:open:\n\n\tSpecify the number of permutations to make per feature when computing feature importance. This is set to 1 by default. ``autodoc_feature_importance_scorer``\n~\n\n.. dropdown:: Feature Importance Scorer\n\t:open:\n\n\tSpecify the name of the scorer to be used when calculating feature importance. Leave this setting unspecified to use the default scorer for the experiment. ``autodoc_pd_max_rows``\n~\n\n.. dropdown:: PDP Max Number of Rows\n\t:open:\n\n\tSpecify the number of rows for Partial Dependence Plots." }, { "output": " Set this value to -1 to disable the time limit. This is set to 20 seconds by default. ``autodoc_out_of_range``\n\n\n.. dropdown:: PDP Out of Range\n\t:open:\n\n\tSpecify the number of standard deviations outside of the range of a column to include in partial dependence plots. This shows how the model reacts to data it has not seen before. This is set to 3 by default. ``autodoc_num_rows``\n\n\n.. dropdown:: ICE Number of Rows\n\t:open:\n\n\tSpecify the number of rows to include in PDP and ICE plots if individual rows are not specified." }, { "output": " ``autodoc_population_stability_index``\n\n\n.. dropdown:: Population Stability Index\n\t:open:\n\n\tSpecify whether to include a population stability index if the experiment is a binary classification or regression problem. This is disabled by default. ``autodoc_population_stability_index_n_quantiles``\n\n\n.. dropdown:: Population Stability Index Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for the population stability index. This is set to 10 by default. ``autodoc_prediction_stats``\n\n\n.. dropdown:: Prediction Statistics\n\t:open:\n\n\tSpecify whether to include prediction statistics information if the experiment is a binary classification or regression problem." }, { "output": " ``autodoc_prediction_stats_n_quantiles``\n\n\n.. dropdown:: Prediction Statistics Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for prediction statistics. This is set to 20 by default. ``autodoc_response_rate``\n~\n\n.. dropdown:: Response Rates Plot\n\t:open:\n\n\tSpecify whether to include response rates information if the experiment is a binary classification problem. This is disabled by default. ``autodoc_response_rate_n_quantiles``\n~\n\n.. dropdown:: Response Rates Plot Number of Quantiles\n\t:open:\n\n\tSpecify the number of quantiles to use for response rates information." }, { "output": " ``autodoc_gini_plot``\n~\n\n.. dropdown:: Show GINI Plot\n\t:open:\n\n\tSpecify whether to show the GINI plot. This is disabled by default. ``autodoc_enable_shapley_values``\n~\n\n.. dropdown:: Enable Shapley Values\n\t:open:\n\n\tSpecify whether to show Shapley values results in the AutoDoc. This is enabled by default. ``autodoc_data_summary_col_num``\n\n\n.. dropdown:: Number of Features in Data Summary Table\n\t:open:\n\n\tSpecify the number of features to be shown in the data summary table. This value must be an integer." }, { "output": " This is set to -1 by default. ``autodoc_list_all_config_settings``\n\n\n.. dropdown:: List All Config Settings\n\t:open:\n\n\tSpecify whether to show all config settings. If this is disabled, only settings that have been changed are listed. All settings are listed when enabled. This is disabled by default. ``autodoc_keras_summary_line_length``\n~\n\n.. dropdown:: Keras Model Architecture Summary Line Length\n\t:open:\n\n\tSpecify the line length of the Keras model architecture summary. This value must be either an integer greater than 0 or -1." }, { "output": " ``autodoc_transformer_architecture_max_lines``\n\n\n.. dropdown:: NLP/Image Transformer Architecture Max Lines\n\t:open:\n\n\tSpecify the maximum number of lines shown for advanced transformer architecture in the Feature section. Note that the full architecture can be found in the appendix. ``autodoc_full_architecture_in_appendix``\n~\n\n.. dropdown:: Appendix NLP/Image Transformer Architecture\n\t:open:\n\n\tSpecify whether to show the full NLP/Image transformer architecture in the appendix. This is disabled by default." }, { "output": " This is disabled by default. ``autodoc_coef_table_num_models``\n~\n\n.. dropdown:: GLM Coefficient Tables Number of Models\n\t:open:\n\n\tSpecify the number of models for which a GLM coefficients table is shown in the AutoDoc. This value must be -1 or an integer >= 1. Set this value to -1 to show tables for all models. This is set to 1 by default. ``autodoc_coef_table_num_folds``\n\n\n.. dropdown:: GLM Coefficient Tables Number of Folds Per Model\n\t:open:\n\n\tSpecify the number of folds per model for which a GLM coefficients table is shown in the AutoDoc." }, { "output": " ``autodoc_coef_table_num_coef``\n~\n\n.. dropdown:: GLM Coefficient Tables Number of Coefficients\n\t:open:\n\n\tSpecify the number of coefficients to show within a GLM coefficients table in the AutoDoc. This is set to 50 by default. Set this value to -1 to show all coefficients. ``autodoc_coef_table_num_classes``\n\n\n.. dropdown:: GLM Coefficient Tables Number of Classes\n\t:open:\n\n\tSpecify the number of classes to show within a GLM coefficients table in the AutoDoc. Set this value to -1 to show all classes." }, { "output": " Snowflake Setup\n- \n\nDriverless AI allows you to explore Snowflake data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Snowflake. This setup requires you to enable authentication. If you enable Snowflake connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n~\n\n- ``snowflake_account``: The Snowflake account ID\n- ``snowflake_user``: The username for accessing the Snowflake account\n- ``snowflake_password``: The password for accessing the Snowflake account\n- ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Enable Snowflake with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the Snowflake data connector with authentication by passing the ``account``, ``user``, and ``password`` variables." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, snow\"``\n - ``snowflake_account = \"\"``\n - ``snowflake_user = \"\"``\n - ``snowflake_password = \"\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Snowflake data connector with authentication by passing the ``account``, ``user``, and ``password`` variables." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, snow\"\n\n # Snowflake Connector credentials\n snowflake_account = \"\"\n snowflake_user = \"\"\n snowflake_password = \"\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI. Adding Datasets Using Snowflake\n \n\nAfter the Snowflake connector is enabled, you can add datasets by selecting Snowflake from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " 1. Enter Database: Specify the name of the Snowflake database that you are querying. 2. Enter Warehouse: Specify the name of the Snowflake warehouse that you are querying. 3. Enter Schema: Specify the schema of the dataset that you are querying. 4. Enter Name for Dataset to Be Saved As: Specify a name for the dataset to be saved as. Note that this can only be a CSV file (for example, myfile.csv). 5. Enter Username: (Optional) Specify the username associated with this Snowflake account. This can be left blank if ``snowflake_user`` was specified in the config.toml when starting Driverless AI; otherwise, this field is required." }, { "output": " Enter Password: (Optional) Specify the password associated with this Snowflake account. This can be left blank if ``snowflake_password`` was specified in the config.toml when starting Driverless AI; otherwise, this field is required. 7. Enter Role: (Optional) Specify your role as designated within Snowflake. See https://docs.snowflake.net/manuals/user-guide/security-access-control-overview.html for more information. 8. Enter Region: (Optional) Specify the region of the warehouse that you are querying." }, { "output": " This is optional and can also be left blank if ``snowflake_url`` was specified with a ```` in the config.toml when starting Driverless AI. 9. Enter File Formatting Parameters: (Optional) Specify any additional parameters for formatting your datasets. Available parameters are listed in https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#type-csv. (Note: Use only parameters for ``TYPE = CSV``.) For example, if your dataset includes a text column that contains commas, you can specify a different delimiter using ``FIELD_DELIMITER='character'``." }, { "output": " For example, you might specify the following to load the \"AMAZON_REVIEWS\" dataset:\n\n * Database: UTIL_DB\n * Warehouse: DAI_SNOWFLAKE_TEST\n * Schema: AMAZON_REVIEWS_SCHEMA\n * Query: SELECT * FROM AMAZON_REVIEWS\n * Enter File Formatting Parameters (Optional): FIELD_OPTIONALLY_ENCLOSED_BY = '\"' \n\n In the above example, if the ``FIELD_OPTIONALLY_ENCLOSED_BY`` option is not set, the following row will result in a failure to import the dataset (as the dataset's delimiter is ``,`` by default):\n\n ::\n \n positive, 2012-05-03,Wonderful\\, tasty taffy,0,0,3,5,2012,Thu,0\n\n Note: Numeric columns from Snowflake that have NULL values are sometimes converted to strings (for example, `\\\\ \\\\N`)." }, { "output": " .. _install-on-windows:\n\nWindows 10\n\n\nThis section describes how to install, start, stop, and upgrade Driverless AI on a Windows 10 machine. The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://h2o.ai/o/try-driverless-ai/. Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " Notes:\n\n- GPU support is not available on Windows. - Scoring is not available on Windows. Caution: Installing Driverless AI on Windows 10 is not recommended for serious use. Environment\n~\n\n+-+-+-+-+\n| Operating System | GPU Support? | Min Mem | Suitable for |\n+=+=+=+=+\n| Windows 10 Pro | No | 16 GB | Experimentation |\n+-+-+-+-+\n| Windows 10 Enterprise | No | 16 GB | Experimentation |\n+-+-+-+-+\n| Windows 10 Education | No | 16 GB | Experimentation |\n+-+-+-+-+\n\nNote: Driverless AI cannot be installed on versions of Windows 10 that do not support Hyper-V." }, { "output": " Docker Image Installation\n~\n\nNotes: \n\n- Be aware that there are known issues with Docker for Windows. More information is available here: https://github.com/docker/for-win/issues/188. - Consult with your Windows System Admin if \n\n - Your corporate environment does not allow third-part software installs\n - You are running Windows Defender\n - You your machine is not running with ``Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux``. Watch the installation video `here `__." }, { "output": " Requirements\n'\n\n- Windows 10 Pro / Enterprise / Education\n- Docker Desktop for Windows 2.2.0.3 (42716)\n\nNote: As of this writing, Driverless AI has only been tested on Docker Desktop for Windows version 2.2.0.3 (42716). Installation Procedure\n\n\n1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Download, install, and run Docker for Windows from https://docs.docker.com/docker-for-windows/install/. You can verify that Docker is running by typing ``docker version`` in a terminal (such as Windows PowerShell)." }, { "output": " 3. Before running Driverless AI, you must:\n\n - Enable shared access to the C drive. Driverless AI will not be able to see your local data if this is not set. - Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won\u2019t run at all with less than 10 GB of memory. - Optionally adjust the number of CPUs given to Docker. You can adjust these settings by clicking on the Docker whale in your taskbar (look for hidden tasks, if necessary), then selecting Settings > Shared Drive and Settings > Advanced as shown in the following screenshots." }, { "output": " (Docker will restart.) Note that if you cannot make changes, stop Docker and then start Docker again by right clicking on the Docker icon on your desktop and selecting Run as Administrator. .. image:: ../images/windows_docker_menu_bar.png\n :align: center\n :width: 252\n :height: 262\n\n\\\n\n .. image:: ../images/windows_shared_drive_access.png\n :align: center\n :scale: 40%\n\n\\\n\n .. image:: ../images/windows_docker_advanced_preferences.png\n :align: center\n :width: 502\n :height: 326\n\n4." }, { "output": " With Docker running, navigate to the location of your downloaded Driverless AI image. Move the downloaded Driverless AI image to your new directory. 6. Change directories to the new directory, then load the image using the following command:\n\n .. code-block:: bash\n :substitutions:\n \n cd |VERSION-dir|\n docker load -i .\\dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7. Set up the data, log, license, and tmp directories (within the new directory). .. code-block:: bash\n\n md data\n md log\n md license\n md tmp\n\n8." }, { "output": " The data will be visible inside the Docker container at /data. 9. Run ``docker images`` to find the image tag. 10. Start the Driverless AI Docker image. Be sure to replace ``path_to_`` below with the entire path to the location of the folders that you created (for example, \"c:/Users/user-name/driverlessai_folder/data\"). Note that this is regular Docker, not NVIDIA Docker. GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini prints a (harmless) warning message." }, { "output": " Add Custom Recipes\n\n\nCustom recipes are Python code snippets that can be uploaded into Driverless AI at runtime like plugins. Restarting Driverless AI is not required. If you do not have a custom recipe, you can select from a number of recipes available in the `Recipes for H2O Driverless AI repository `_. For more information and examples, refer to :ref:`custom-recipes`. To add a custom recipe to Driverless AI, click Add Custom Recipe and select one of the following options:\n\n- From computer: Add a custom recipe as a Python or ZIP file from your local file system." }, { "output": " - From Bitbucket: Add a custom recipe from a Bitbucket repository. To use this option, your Bitbucket username and password must be provided along with the custom recipe Bitbucket URL. Official Recipes (Open Source)\n\n\nTo access `H2O's official recipes repository `_, click Official Recipes (Open Source). .. _edit-toml:\n\nEditing the TOML Configuration\n\n\nTo open the built-in TOML configuration editor, click TOML in the :ref:`expert-settings` window. If you change the default value of an expert setting from the Expert Settings window, that change is displayed in the TOML configuration editor." }, { "output": " The TOML configuration editor lets you manually add, remove, or edit expert setting parameters. To confirm your changes, click Save. The experiment preview updates to reflect your specified configuration changes. For a full list of available settings, see :ref:`expert-settings`. .. note::\n\tDo not edit the section below the ``[recipe_activation]`` line. This section provides Driverless AI with information about which custom recipes can be used by the experiment. This is important for keeping experiments comparable when performing retrain / refit operations." }, { "output": " .. _h2o_drive:\n\n###############\nH2O Drive setup\n###############\n\nH2O Drive is an object-store for `H2O AI Cloud `_. This page describes how to configure Driverless AI to work with H2O Drive. Note: For more information on the H2O Drive, refer to the `official documentation `_. Description of relevant configuration attributes\n\n\nThe following are descriptions of the relevant configuration attributes when enabling the H2O AI Feature Store data connector:\n\n- ``enabled_file_systems``: A list of file systems you want to enable." }, { "output": " - ``h2o_drive_endpoint_url``: The H2O Drive server endpoint URL. - ``h2o_drive_access_token_scopes``: A space-separated list of OpenID scopes for the access token that are used by the H2O Drive connector. - ``h2o_drive_session_duration``: The maximum duration in seconds for a session with the H2O Drive. - ``authentication_method``: The authentication method used by DAI. When enabling the Feature Store data connector, this must be set to OpenID Connect (``authentication_method=\"oidc\"``). For information on setting up OIDC Authentication in Driverless AI, see :ref:`oidc_auth`." }, { "output": " .. _install-on-macosx:\n\nMac OS X\n\n\nThis section describes how to install, start, stop, and upgrade the Driverless AI Docker image on Mac OS X. Note that this uses regular Docker and not NVIDIA Docker. Note: Support for GPUs and MOJOs is not available on Mac OS X. The installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://h2o.ai/o/try-driverless-ai/. Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " Stick to small datasets! For serious use, please use Linux. - Be aware that there are known performance issues with Docker for Mac. More information is available here: https://docs.docker.com/docker-for-mac/osxfs/#technology. Environment\n~\n\n+-+-+-+-+\n| Operating System | GPU Support? | Min Mem | Suitable for |\n+=+=+=+=+\n| Mac OS X | No | 16 GB | Experimentation |\n+-+-+-+-+\n\nInstalling Driverless AI\n\n\n1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/." }, { "output": " Download and run Docker for Mac from https://docs.docker.com/docker-for-mac/install. 3. Adjust the amount of memory given to Docker to be at least 10 GB. Driverless AI won't run at all with less than 10 GB of memory. You can optionally adjust the number of CPUs given to Docker. You will find the controls by clicking on (Docker Whale)->Preferences->Advanced as shown in the following screenshots. (Don't forget to Apply the changes after setting the desired memory value.) .. image:: ../images/macosx_docker_menu_bar.png\n :align: center\n\n.. image:: ../images/macosx_docker_advanced_preferences.png\n :align: center\n :height: 507\n :width: 382\n\n4." }, { "output": " More information is available here: https://docs.docker.com/docker-for-mac/osxfs/#namespaces. .. image:: ../images/macosx_docker_filesharing.png\n :align: center\n :scale: 40%\n\n5. Set up a directory for the version of Driverless AI within the Terminal: \n\n .. code-block:: bash\n :substitutions:\n\n mkdir |VERSION-dir|\n\n6. With Docker running, open a Terminal and move the downloaded Driverless AI image to your new directory. 7. Change directories to the new directory, then load the image using the following command:\n\n .. code-block:: bash\n :substitutions:\n\n cd |VERSION-dir|\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n8." }, { "output": " Optionally copy data into the data directory on the host. The data will be visible inside the Docker container at /data. You can also upload data after starting Driverless AI. 10. Run ``docker images`` to find the image tag. 11. Start the Driverless AI Docker image (still within the new Driverless AI directory). Replace TAG below with the image tag. Note that GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini prints a (harmless) warning message." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. .. code-block:: bash\n :substitutions:\n\n docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n12. Connect to Driverless AI with your browser at http://localhost:12345." }, { "output": " These steps ensure that existing experiments are saved. WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded when Driverless AI is upgraded. - Build MLI models before upgrading. - Build MOJO pipelines before upgrading. - Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading. If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view MLI on that model after upgrading." }, { "output": " If that MLI job appears in the list of Interpreted Models in your current version, then it will be retained after upgrading. If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO pipelines on all desired models and then back up your Driverless AI tmp directory. Note: Stop Driverless AI if it is still running. Upgrade Steps\n'\n\n1. SSH into the IP address of the machine that is running Driverless AI." }, { "output": " Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n # cd into the new directory\n cd |VERSION-dir|\n\n3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory. 4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " .. _features-settings:\n\nFeatures Settings\n=\n\n``feature_engineering_effort``\n\n\n.. dropdown:: Feature Engineering Effort\n\t:open:\n\n\tSpecify a value from 0 to 10 for the Driverless AI feature engineering effort. Higher values generally lead to more time (and memory) spent in feature engineering. This value defaults to 5. - 0: Keep only numeric features. Only model tuning during evolution. - 1: Keep only numeric features and frequency-encoded categoricals. Only model tuning during evolution. - 2: Similar to 1 but instead just no Text features." }, { "output": " - 3: Similar to 5 but only tuning during evolution. Mixed tuning of features and model parameters. - 4: Similar to 5 but slightly more focused on model tuning. - 5: Balanced feature-model tuning. (Default)\n\t- 6-7: Similar to 5 but slightly more focused on feature engineering. - 8: Similar to 6-7 but even more focused on feature engineering with high feature generation rate and no feature dropping even if high interpretability. - 9-10: Similar to 8 but no model tuning during feature evolution. .. _check_distribution_shift:\n\n``check_distribution_shift``\n\n\n.. dropdown:: Data Distribution Shift Detection\n\t:open:\n\n\tSpecify whether Driverless AI should detect data distribution shifts between train/valid/test datasets (if provided)." }, { "output": " Currently, this information is only presented to the user and not acted upon. Shifted features should either be dropped. Or more meaningful aggregate features be created by using them as labels or bins. Also see :ref:`drop_features_distribution_shift_threshold_auc ` and :ref:`check_distribution_shift_drop `. .. _check_distribution_shift_drop:\n\n``check_distribution_shift_drop``\n~\n\n.. dropdown:: Data Distribution Shift Detection Drop of Features\n\t:open:\n\n\tSpecify whether to drop high-shift features." }, { "output": " Note that Auto for time series experiments turns this feature off. Also see :ref:`drop_features_distribution_shift_threshold_auc ` and :ref:`check_distribution_shift `. .. _drop_features_distribution_shift_threshold_auc:\n\n``drop_features_distribution_shift_threshold_auc``\n\n\n.. dropdown:: Max Allowed Feature Shift (AUC) Before Dropping Feature\n\t:open:\n\n\tSpecify the maximum allowed AUC value for a feature before dropping the feature." }, { "output": " This model includes an AUC value. If this AUC, GINI, or Spearman correlation of the model is above the specified threshold, then Driverless AI will consider it a strong enough shift to drop those features. The default AUC threshold is 0.999. .. _check_leakage:\n\n``check_leakage``\n~\n\n.. dropdown:: Data Leakage Detection\n\t:open:\n\n\tSpecify whether to check for data leakage for each feature. Some of the features may contain over predictive power on the target column. This may affect model generalization." }, { "output": " Then, a simple model is built on each feature with significant variable importance. The models with high AUC (for classification) or R2 score (regression) are reported to the user as potential leak. Note that this option is always disabled if the experiment is a time series experiment. This is set to Auto by default. The equivalent config.toml parameter is ``check_leakage``. Also see :ref:`drop_features_leakage_threshold_auc `\n\n.. _drop_features_leakage_threshold_auc:\n\n``drop_features_leakage_threshold_auc``\n~\n\n.. dropdown:: Data Leakage Detection Dropping AUC/R2 Threshold\n\t:open:\n\n\tIf :ref:`Leakage Detection ` is enabled, specify the threshold for dropping features." }, { "output": " This value defaults to 0.999. The equivalent config.toml parameter is ``drop_features_leakage_threshold_auc``. ``leakage_max_data_size``\n~\n\n.. dropdown:: Max Rows X Columns for Leakage\n\t:open:\n\n\tSpecify the maximum number of (rows x columns) to trigger sampling for leakage checks. This value defaults to 10,000,000. ``max_features_importance``\n~\n\n.. dropdown:: Max. num. features for variable importance\n\t:open:\n\n\tSpecify the maximum number of features to use and show in importance tables. For any interpretability higher than 1, transformed or original features with low importance than top max_features_importance features are always removed Feature importances of transformed or original features correspondingly will be pruned." }, { "output": " .. _enable_wide_rules:\n\n``enable_wide_rules``\n~\n\n.. dropdown:: Enable Wide Rules\n\t:open:\n\n\tEnable various rules to handle wide datasets( i.e no. of columns > no. of rows). The default value is \"auto\", that will automatically enable the wide rules when detect that number of columns is greater than number of rows. Setting \"on\" forces rules to be enabled regardless of any conditions. Enabling wide data rules sets all ``max_cols``, ``max_orig_*col``, and ``fs_orig*`` tomls to large values, and enforces monotonicity to be disabled unless ``monotonicity_constraints_dict`` is set or default value of ``monotonicity_constraints_interpretability_switch`` is changed." }, { "output": " And enables :ref:`Xgboost Random Forest model ` for modeling. To disable wide rules, set enable_wide_rules to \"off\". For mostly or entirely numeric datasets, selecting only 'OriginalTransformer' for faster speed is recommended (see :ref:`included_transformers `). Also see :ref:`wide_datasets_dai` for a quick model run. ``orig_features_fs_report``\n~\n\n.. dropdown:: Report Permutation Importance on Original Features\n\t:open:\n\n\tSpecify whether Driverless AI reports permutation importance on original features (represented as normalized change in the chosen metric) in logs and the report file." }, { "output": " ``max_rows_fs``\n~\n\n.. dropdown:: Maximum Number of Rows to Perform Permutation-Based Feature Selection\n\t:open:\n\n\tSpecify the maximum number of rows when performing permutation feature importance, reduced by (stratified) random sampling. This value defaults to 500,000. ``max_orig_cols_selected``\n\n\n.. dropdown:: Max Number of Original Features Used\n\t:open:\n\n\tSpecify the maximum number of columns to be selected from an existing set of columns using feature selection. This value defaults to 10,000000." }, { "output": " This is useful to reduce the final model complexity. First the best [max_orig_cols_selected] are found through feature selection methods and then these features are used in feature evolution (to derive other features) and in modelling. ``max_orig_nonnumeric_cols_selected``\n~\n\n.. dropdown:: Max Number of Original Non-Numeric Features\n\t:open:\n\n\tMaximum number of non-numeric columns selected, above which will do feature selection on all features and avoid treating numerical as categorical same as above (max_orig_numeric_cols_selected) but for categorical columns." }, { "output": " This value defaults to 300. ``fs_orig_cols_selected``\n~\n\n.. dropdown:: Max Number of Original Features Used for FS Individual\n\t:open:\n\n\tSpecify the maximum number of features you want to be selected in an experiment. This value defaults to 10,0000000. Additional columns above the specified value add special individual with original columns reduced. ``fs_orig_numeric_cols_selected``\n~\n\n.. dropdown:: Number of Original Numeric Features to Trigger Feature Selection Model Type\n\t:open:\n\n\tThe maximum number of original numeric columns, above which Driverless AI will do feature selection." }, { "output": " A separate individual in the :ref:`genetic algorithm ` is created by doing feature selection by permutation importance on original features. This value defaults to 10,000000. ``fs_orig_nonnumeric_cols_selected``\n\n\n.. dropdown:: Number of Original Non-Numeric Features to Trigger Feature Selection Model Type\n\t:open:\n\n\tThe maximum number of original non-numeric columns, above which Driverless AI will do feature selection on all features. Note that this is applicable only to special individuals with original columns reduced." }, { "output": " This value defaults to 200. ``max_relative_cardinality``\n\n\n.. dropdown:: Max Allowed Fraction of Uniques for Integer and Categorical Columns\n\t:open:\n\n\tSpecify the maximum fraction of unique values for integer and categorical columns. If the column has a larger fraction of unique values than that, it will be considered an ID column and ignored. This value defaults to 0.95. .. _num_as_cat:\n\n``num_as_cat``\n\n\n.. dropdown:: Allow Treating Numerical as Categorical\n\t:open:\n\n\tSpecify whether to allow some numerical features to be treated as categorical features." }, { "output": " The equivalent config.toml parameter is ``num_as_cat``. ``max_int_as_cat_uniques``\n\n\n.. dropdown:: Max Number of Unique Values for Int/Float to be Categoricals\n\t:open:\n\n\tSpecify the number of unique values for integer or real columns to be treated as categoricals. This value defaults to 50. ``max_fraction_invalid_numeric``\n\n\n.. dropdown:: Max. fraction of numeric values to be non-numeric (and not missing) for a column to still be considered numeric\n\t:open:\n\n\tWhen the fraction of non-numeric (and non-missing) values is less or equal than this value, consider the column numeric." }, { "output": " Note: Replaces non-numeric values with missing values at start of experiment, so some information is lost, but column is now treated as numeric, which can help. Disabled if < 0. .. _nfeatures_max:\n\n``nfeatures_max``\n~\n\n.. dropdown:: Max Number of Engineered Features\n\t:open:\n\n\tSpecify the maximum number of features to be included per model (and in each model within the final model if an ensemble). After each scoring, based on this parameter value, keeps top variable importance features, and prunes away rest of the features." }, { "output": " new clusters). Final scoring pipeline will exclude any pruned-away features, but may contain a few new features due to fitting on different data view (e.g. new clusters). The default value of -1 means no restrictions are applied for this parameter except internally-determined memory and interpretability restrictions. Notes:\n\n\t * If ``interpretability`` > ``remove_scored_0gain_genes_in_postprocessing_above_interpretability`` (see :ref:`config.toml ` for reference), then every GA (:ref:`genetic algorithm `) iteration post-processes features down to this value just after scoring them." }, { "output": " * If ``ngenes_max`` is also not limited, then some individuals will have more genes and features until pruned by mutation or by preparation for final model. * E.g. to generally limit every iteration to exactly 1 features, one must set ``nfeatures_max`` = ``ngenes_max`` =1 and ``remove_scored_0gain_genes_in_postprocessing_above_interpretability`` = 0, but the genetic algorithm will have a harder time finding good features. The equivalent config.toml parameter is ``nfeatures_max`` (also see ``nfeatures_max_threshold`` in :ref:`config.toml `)." }, { "output": " This controls the number of genes before features are scored, so Driverless AI will just randomly samples genes if pruning occurs. If restriction occurs after scoring features, then aggregated gene importances are used for pruning genes. Instances includes all possible transformers, including original transformer for numeric features. A value of -1 means no restrictions except internally-determined memory and interpretability restriction. The equivalent config.toml parameter is ``ngenes_max``. ``features_allowed_by_interpretability``\n\n\n.. dropdown:: Limit Features by Interpretability\n\t:open:\n\n\tSpecify whether to limit feature counts with the Interpretability training setting as specified by the ``features_allowed_by_interpretability`` :ref:`config.toml ` setting." }, { "output": " This value defaults to 7. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference. .. _monotonicity-constraints-correlation-threshold:\n\n``monotonicity_constraints_correlation_threshold``\n\n\n.. dropdown:: Correlation Beyond Which to Trigger Monotonicity Constraints (if enabled)\n\t:open:\n\n\tSpecify the threshold of Pearson product-moment correlation coefficient between numerical or encoded transformed feature and target above (below negative for) which to use positive (negative) monotonicity for XGBoostGBM, LightGBM and Decision Tree models." }, { "output": " Note: This setting is only enabled when Interpretability is greater than or equal to the value specified by the :ref:`enable-constraints` setting and when the :ref:`constraints-override` setting is not specified. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference. ``monotonicity_constraints_log_level``\n\n\n.. dropdown:: Control amount of logging when calculating automatic monotonicity constraints (if enabled)\n\t:open:\n\n\tFor models that support monotonicity constraints, and if enabled, show automatically determined monotonicity constraints for each feature going into the model based on its correlation with the target." }, { "output": " 'medium' shows correlation of positively and negatively constraint features. 'high' shows all correlation values. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference. .. _monotonicity-constraints-drop-low-correlation-features:\n\n``monotonicity_constraints_drop_low_correlation_features``\n\n\n.. dropdown:: Whether to drop features that have no monotonicity constraint applied (e.g., due to low correlation with target)\n\t:open:\n\n\tIf enabled, only monotonic features with +1/-1 constraints will be passed to the model(s), and features without monotonicity constraints (0) will be dropped." }, { "output": " Only active when interpretability >= monotonicity_constraints_interpretability_switch or monotonicity_constraints_dict is provided. Also see :ref:`monotonic gbm recipe ` and :ref:`Monotonicity Constraints in Driverless AI ` for reference. .. _constraints-override:\n\n``monotonicity_constraints_dict``\n\n\n.. dropdown:: Manual Override for Monotonicity Constraints\n\t:open:\n\n\tSpecify a list of features for max_features_importance which monotonicity constraints are applied." }, { "output": " The following is an example of how this list can be specified:\n\n\t::\n\n\t \"{'PAY_0': -1, 'PAY_2': -1, 'AGE': -1, 'BILL_AMT1': 1, 'PAY_AMT1': -1}\"\n\n\tNote: If a list is not provided, then the automatic correlation-based method is used when monotonicity constraints are enabled at high enough interpretability settings. See :ref:`Monotonicity Constraints in Driverless AI ` for reference. .. _max-feature-interaction-depth:\n\n``max_feature_interaction_depth``\n~\n\n.. dropdown:: Max Feature Interaction Depth\n\t:open:\n\n\tSpecify the maximum number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates." }, { "output": " The interaction can take multiple forms (i.e. feature1 + feature2 or feature1 * feature2 + \u2026 featureN). Although certain machine learning algorithms (like tree-based methods) can do well in capturing these interactions as part of their training process, still generating them may help them (or other algorithms) yield better performance. The depth of the interaction level (as in \"up to\" how many features may be combined at once to create one single feature) can be specified to control the complexity of the feature engineering process." }, { "output": " This value defaults to 8. Set Max Feature Interaction Depth to 1 to disable any feature interactions ``max_feature_interaction_depth=1``. ``fixed_feature_interaction_depth``\n~\n\n.. dropdown:: Fixed Feature Interaction Depth\n\t:open:\n\n\tSpecify a fixed non-zero number of features to use for interaction features like grouping for target encoding, weight of evidence, and other likelihood estimates. To use all features for each transformer, set this to be equal to the number of columns. To do a 50/50 sample and a fixed feature interaction depth of :math:`n` features, set this to -:math:`n`." }, { "output": " Target encoding refers to several different feature transformations (primarily focused on categorical data) that aim to represent the feature using information of the actual target variable. A simple example can be to use the mean of the target to replace each unique category of a categorical feature. These type of features can be very predictive but are prone to overfitting and require more memory as they need to store mappings of the unique categories and the target values. ``cvte_cv_in_cv``\n~\n\n.. dropdown:: Enable Outer CV for Target Encoding\n\t:open:\n\n\tFor target encoding, specify whether an outer level of cross-fold validation is performed in cases where GINI is detected to flip sign or have an inconsistent sign for weight of evidence between ``fit_transform`` (on training data) and ``transform`` (on training and validation data)." }, { "output": " This is enabled by default. ``enable_lexilabel_encoding``\n~\n\n.. dropdown:: Enable Lexicographical Label Encoding\n\t:open:\n\n\tSpecify whether to enable lexicographical label encoding. This is disabled by default. ``enable_isolation_forest``\n~\n\n.. dropdown:: Enable Isolation Forest Anomaly Score Encoding\n\t:open:\n\n\t`Isolation Forest `__ is useful for identifying anomalies or outliers in data. Isolation Forest isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of that selected feature." }, { "output": " Random partitioning produces noticeably shorter paths for anomalies. When a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies. This option lets you specify whether to return the anomaly score of each sample. This is disabled by default. ``enable_one_hot_encoding``\n~\n\n.. dropdown:: Enable One HotEncoding\n\t:open:\n\n\tSpecify whether one-hot encoding is enabled. The default Auto setting is only applicable for small datasets and GLMs." }, { "output": " This value defaults to 200. ``drop_constant_columns``\n~\n\n.. dropdown:: Drop Constant Columns\n\t:open:\n\n\tSpecify whether to drop columns with constant values. This is enabled by default. ``drop_id_columns``\n~\n\n.. dropdown:: Drop ID Columns\n\t:open:\n\n\tSpecify whether to drop columns that appear to be an ID. This is enabled by default. ``no_drop_features``\n\n\n.. dropdown:: Don't Drop Any Columns\n\t:open:\n\n\tSpecify whether to avoid dropping any columns (original or derived). This is disabled by default." }, { "output": " This setting allows you to select many features at once by copying and pasting a list of column names (in quotes) separated by commas. .. _cols_to_force_in:\n\n``cols_to_force_in``\n~\n\n.. dropdown:: Features to always keep or force in, e.g. \"G1\", \"G2\", \"G3\"\n\t:open:\n\n\tControl over columns to force-in. Forced-in features are handled by the most interpretable transformers allowed by the experiment options, and they are never removed (even if the model assigns 0 importance to them). Transformers used by default includes:\n\n\t\t- OriginalTransformer for numeric,\n\t\t- CatOriginalTransformer or FrequencyTransformer for categorical,\n\t\t- TextOriginalTransformer for text,\n\t\t- DateTimeOriginalTransformer for date-times,\n\t\t- DateOriginalTransformer for dates,\n\t\t- ImageOriginalTransformer or ImageVectorizerTransformer for images, etc\n\n\n\n``cols_to_group_by``\n\n\n.. dropdown:: Features to Group By\n\t:open:\n\n\tSpecify which features to group columns by." }, { "output": " ``sample_cols_to_group_by``\n~\n\n.. dropdown:: Sample from Features to Group By\n\t:open:\n\n\tSpecify whether to sample from given features to group by or to always group all features. This is disabled by default. ``agg_funcs_for_group_by``\n\n\n.. dropdown:: Aggregation Functions (Non-Time-Series) for Group By Operations\n\t:open:\n\n\tSpecify whether to enable aggregation functions to use for group by operations. Choose from the following (all are selected by default):\n\n\t- mean\n\t- sd\n\t- min\n\t- max\n\t- count\n\n``folds_for_group_by``\n\n\n.. dropdown:: Number of Folds to Obtain Aggregation When Grouping\n\t:open:\n\n\tSpecify the number of folds to obtain aggregation when grouping." }, { "output": " The default value is 5. .. _mutation_mode:\n\n``mutation_mode``\n~\n\n.. dropdown:: Type of Mutation Strategy\n\t:open:\n\n\tSpecify which strategy to apply when performing mutations on transformers. Select from the following:\n\n\t- sample: Sample transformer parameters (Default)\n\t- batched: Perform multiple types of the same transformation together\n\t- full: Perform more types of the same transformation together than the above strategy\n\n``dump_varimp_every_scored_indiv``\n\n\n.. dropdown:: Enable Detailed Scored Features Info\n\t:open:\n\n\tSpecify whether to dump every scored individual's variable importance (both derived and original) to a csv/tabulated/json file." }, { "output": " This is disabled by default. ``dump_trans_timings``\n\n\n.. dropdown:: Enable Detailed Logs for Timing and Types of Features Produced\n\t:open:\n\n\tSpecify whether to dump every scored fold's timing and feature info to a timings.txt file. This is disabled by default. ``compute_correlation``\n~\n\n.. dropdown:: Compute Correlation Matrix\n\t:open:\n\n\tSpecify whether to compute training, validation, and test correlation matrixes. When enabled, this setting creates table and heatmap PDF files that are saved to disk." }, { "output": " This is disabled by default. ``interaction_finder_gini_rel_improvement_threshold``\n~\n\n.. dropdown:: Required GINI Relative Improvement for Interactions\n\t:open:\n\n\tSpecify the required GINI relative improvement value for the InteractionTransformer. If the GINI coefficient is not better than the specified relative improvement value in comparison to the original features considered in the interaction, then the interaction is not returned. If the data is noisy and there is no clear signal in interactions, this value can be decreased to return interactions." }, { "output": " ``interaction_finder_return_limit``\n~\n\n.. dropdown:: Number of Transformed Interactions to Make\n\t:open:\n\n\tSpecify the number of transformed interactions to make from generated trial interactions. (The best transformed interactions are selected from the group of generated trial interactions.) This value defaults to 5. .. _enable_rapids_transformers:\n\n``enable_rapids_transformers``\n\n\n.. dropdown:: Whether to enable RAPIDS cuML GPU transformers (no mojo)\n\t:open:\n\n\tSpecify whether to enable GPU-based `RAPIDS cuML `__ transformers." }, { "output": " The equivalent config.toml parameter is ``enable_rapids_transformers`` and the default value is False. .. _lowest_allowed_variable_importance:\n\n``varimp_threshold_at_interpretability_10``\n~\n\n.. dropdown:: Lowest allowed variable importance at interpretability 10\n\t:open:\n\n\tSpecify the variable importance below which features are dropped (with the possibility of a replacement being found that's better). This setting also sets the overall scale for lower interpretability settings. Set this to a lower value if you're content with having many weak features despite choosing high interpretability, or if you see a drop in performance due to the need for weak features." }, { "output": " Delta improvement of score corresponds to original metric minus metric of shuffled feature frame if maximizing metric, and corresponds to negative of such a score difference if minimizing. Feature selection by permutation importance considers the change in score after shuffling a feature, and using minimum operation ignores optimistic scores in favor of pessimistic scores when aggregating over folds. Note, if using tree methods, multiple depths may be fitted, in which case regardless of this toml setting, only features that are kept for all depths are kept by feature selection." }, { "output": " Hive Setup\n\n\nDriverless AI lets you explore Hive data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Hive. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``enabled_file_systems``: The file systems you want to enable." }, { "output": " - ``hive_app_configs``: Configuration for Hive Connector. Inputs are similar to configuring the HDFS connector. Important keys include:\n \n - ``hive_conf_path``: The path to Hive configuration. This can have multiple files (e.g. hive-site.xml, hdfs-site.xml, etc.) - ``auth_type``: Specify one of ``noauth``, ``keytab``, or ``keytabimpersonation`` for Kerberos authentication\n - ``keytab_path``: Specify the path to Kerberos keytab to use for authentication (this can be ``\"\"`` if using ``auth_type=\"noauth\"``)\n - ``principal_user``: Specify the Kerberos app principal user (required when using ``auth_type=\"keytab\"`` or ``auth_type=\"keytabimpersonation\"``)\n\nNotes:\n\n- With Hive connectors, it is assumed that DAI is running on the edge node." }, { "output": " missing classes, dependencies, authorization errors). - Ensure the core-site.xml file (from e.g Hadoop conf) is also present in the Hive conf with the rest of the files (hive-site.xml, hdfs-site.xml, etc.). The core-site.xml file should have proxyuser configured (e.g. ``hadoop.proxyuser.hive.hosts`` & ``hadoop.proxyuser.hive.groups``). - If you have tez as the Hive execution engine, make sure that the required tez dependencies (classpaths, jars, etc.) are available on the DAI node. Alternatively, you can use internal engines that come with DAI by changing your ``hive.execution.engine`` value in the hive-site.xml file to ``mr`` or ``spark``." }, { "output": " For example:\n \n ::\n\n \"\"\"{\n \"hive_connection_1\": {\n \"hive_conf_path\": \"/path/to/hive/conf\",\n \"auth_type\": \"one of ['noauth', 'keytab',\n 'keytabimpersonation']\",\n \"keytab_path\": \"/path/to/.keytab\",\n \"principal_user\": \"hive/node1.example.com@EXAMPLE.COM\",\n },\n \"hive_connection_2\": {\n \"hive_conf_path\": \"/path/to/hive/conf_2\",\n \"auth_type\": \"one of ['noauth', 'keytab', \n 'keytabimpersonation']\",\n \"keytab_path\": \"/path/to/.keytab\",\n \"principal_user\": \"hive/node2.example.com@EXAMPLE.COM\",\n }\n }\"\"\"\n\n \\ Note: The expected input of ``hive_app_configs`` is a `JSON string `__." }, { "output": " Depending on how the configuration value is applied, different forms of outer quotations may be required. The following examples show two unique methods for applying outer quotations. - Configuration value applied with the config.toml file:\n\n ::\n\n hive_app_configs = \"\"\"{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}\"\"\"\n\n - Configuration value applied with an environment variable:\n\n ::\n\n DRIVERLESS_AI_HIVE_APP_CONFIGS='{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}'\n\n- ``hive_app_jvm_args``: Optionally specify additional Java Virtual Machine (JVM) args for the Hive connector." }, { "output": " Notes:\n\n - If a custom `JAAS configuration file `__ is needed for your Kerberos setup, use ``hive_app_jvm_args`` to specify the appropriate file:\n\n ::\n\n hive_app_jvm_args = \"-Xmx20g -Djava.security.auth.login.config=/etc/dai/jaas.conf\"\n\n Sample ``jaas.conf`` file:\n ::\n\n com.sun.security.jgss.initiate {\n com.sun.security.auth.module.Krb5LoginModule required\n useKeyTab=true\n useTicketCache=false\n principal=\"hive/localhost@EXAMPLE.COM\" [Replace this line]\n doNotPrompt=true\n keyTab=\"/path/to/hive.keytab\" [Replace this line]\n debug=true;\n };\n\n- ``hive_app_classpath``: Optionally specify an alternative classpath for the Hive connector." }, { "output": " This can be done by specifying each environment variable in the ``nvidia-docker run`` command or by editing the configuration options in the config.toml file and then specifying that file in the ``nvidia-docker run`` command. .. tabs:: \n .. group-tab:: Docker Image Installs\n\n 1. Start the Driverless AI Docker Image. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs,hive\" \\\n -e DRIVERLESS_AI_HIVE_APP_CONFIGS='{\"hive_connection_2: {\"hive_conf_path\":\"/etc/hadoop/conf\",\n \"auth_type\":\"keytabimpersonation\",\n \"keytab_path\":\"/etc/dai/steam.keytab\",\n \"principal_user\":\"steam/mr-0xg9.0xdata.loc@H2OAI.LOC\"}}' \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -v /path/to/hive/conf:/path/to/hive/conf/in/docker \\\n -v /path/to/hive.keytab:/path/in/docker/hive.keytab \\\n -u $(id -u):${id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure Hive options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Enable and configure the Hive connector in the Driverless AI config.toml file. The Hive connector configuration must be a JSON/Dictionary string with multiple keys. .. code-block:: bash \n\n enabled_file_systems = \"file, hdfs, s3, hive\"\n hive_app_configs = \"\"\"{\"hive_1\": {\"auth_type\": \"keytab\",\n \"key_tab_path\": \"/path/to/hive.keytab\",\n \"hive_conf_path\": \"/path/to/hive-resources\",\n \"principal_user\": \"hive/localhost@EXAMPLE.COM\"}}\"\"\"\n\n 2." }, { "output": " .. code-block:: bash \n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro /\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -v /path/to/hive/conf:/path/to/hive/conf/in/docker \\\n -v /path/to/hive.keytab:/path/in/docker/hive.keytab \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Native Installs\n\n This enables the Hive connector." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\"\n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs, s3, hive\"\n\n \n # Configuration for Hive Connector\n # Note that inputs are similar to configuring HDFS connectivity\n # Important keys:\n # * hive_conf_path - path to hive configuration, may have multiple files. Typically: hive-site.xml, hdfs-site.xml, etc\n # * auth_type - one of `noauth`, `keytab`, `keytabimpersonation` for kerberos authentication\n # * keytab_path - path to the kerberos keytab to use for authentication, can be \"\" if using `noauth` auth_type\n # * principal_user = Kerberos app principal user." }, { "output": " Example:\n # \"\"\"{\n # \"hive_connection_1\": {\n # \"hive_conf_path\": \"/path/to/hive/conf\",\n # \"auth_type\": \"one of ['noauth', 'keytab', 'keytabimpersonation']\",\n # \"keytab_path\": \"/path/to/.keytab\",\n # principal_user\": \"hive/localhost@EXAMPLE.COM\",\n # }\n # }\"\"\"\n #\n hive_app_configs = \"\"\"{\"hive_1\": {\"auth_type\": \"keytab\",\n \"key_tab_path\": \"/path/to/hive.keytab\",\n \"hive_conf_path\": \"/path/to/hive-resources\",\n \"principal_user\": \"hive/localhost@EXAMPLE.COM\"}}\"\"\"\n\n 3." }, { "output": " Adding Datasets Using Hive\n~\n\nAfter the Hive connector is enabled, you can add datasets by selecting Hive from the Add Dataset (or Drag and Drop) drop-down menu. 1. Select the Hive configuraton that you want to use. .. figure:: ../images/hive_select_configuration.png\n :alt: Select Hive configuration\n\n2. Specify the following information to add your dataset. - Hive Database: Specify the name of the Hive database that you are querying. - Hadoop Configuration Path: Specify the path to your Hive configuration file." }, { "output": " Install on Ubuntu\n-\n\nThis section describes how to install the Driverless AI Docker image on Ubuntu. The installation steps vary depending on whether your system has GPUs or if it is CPU only. Environment\n~\n\n+-+-+-+\n| Operating System | GPUs? | Min Mem |\n+=+=+=+\n| Ubuntu with GPUs | Yes | 64 GB |\n+-+-+-+\n| Ubuntu with CPUs | No | 64 GB |\n+-+-+-+\n\n.. _install-on-ubuntu-with-gpus:\n\nInstall on Ubuntu with GPUs\n~\n\nNote: Driverless AI is supported on Ubuntu 16.04 or later." }, { "output": " Once you are logged in, perform the following steps. 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. (Note that the contents of this Docker image include a CentOS kernel and CentOS packages.) 2. Install and run Docker on Ubuntu (if not already installed):\n\n .. code-block:: bash\n\n # Install and run Docker on Ubuntu\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \\ \n \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\" \n sudo apt-get update\n sudo apt-get install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " More information is available at https://github.com/NVIDIA/nvidia-docker/blob/master/README.md. .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(. /etc/os-release;echo $ID$VERSION_ID)\n curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \\\n sudo tee /etc/apt/sources.list.d/nvidia-docker.list\n sudo apt-get update\n\n # Install nvidia-docker2 and reload the Docker daemon configuration\n sudo apt-get install -y nvidia-docker2\n\n4." }, { "output": " If the driver is not up and running, log on to http://www.nvidia.com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver: \n\n .. code-block:: bash\n\n nvidia-smi\n\n5. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n6. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\n8. Set up the data, log, and license directories on the host machine:\n\n .. code-block:: bash\n\n # Set up the data, log, license, and tmp directories on the host machine (within the new directory)\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n9. At this point, you can copy data into the data directory on the host machine." }, { "output": " 10. Run ``docker images`` to find the image tag. 11. Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message. We recommend ``shm-size=256m`` in docker launch command." }, { "output": " Note: Use ``docker version`` to check which version of Docker you are using. .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag:\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n12." }, { "output": " This section describes how to install and start the Driverless AI Docker image on Ubuntu. Note that this uses ``docker`` and not ``nvidia-docker``. GPU support will not be available. Watch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same. Open a Terminal and ssh to the machine that will run Driverless AI." }, { "output": " 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Install and run Docker on Ubuntu (if not already installed):\n\n .. code-block:: bash\n\n # Install and run Docker on Ubuntu\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n sudo apt-key fingerprint 0EBFCD88 sudo add-apt-repository \\ \n \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n sudo apt-get update\n sudo apt-get install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5. Set up the data, log, license, and tmp directories on the host machine (within the new directory):\n\n .. code-block:: bash\n \n # Set up the data, log, license, and tmp directories\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n6." }, { "output": " The data will be visible inside the Docker container. 7. Run ``docker images`` to find the new image tag. 8. Start the Driverless AI Docker image. Note that GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message. We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " .. _linux-tarsh:\n\nLinux TAR SH\n\n\nThe Driverless AI software is available for use in pure user-mode environments as a self-extracting TAR SH archive. This form of installation does not require a privileged user to install or to run. This artifact has the same compatibility matrix as the RPM and DEB packages (combined), it just comes packaged slightly differently. See those sections for a full list of supported environments. The installation steps assume that you have a valid license key for Driverless AI." }, { "output": " Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in. .. note::\n\tTo ensure that :ref:`AutoDoc ` pipeline visualizations are generated correctly on native installations, installing `fontconfig `_ is recommended. Requirements\n\n\n- RedHat 7/RedHat 8 or Ubuntu 16.04/Ubuntu 18.04/Ubuntu 20.04/Ubuntu 22.04\n- NVIDIA drivers >= |NVIDIA-driver-ver| recommended (GPU only). Note that if you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02\n- OpenCL (Required for full LightGBM support on GPU-powered systems)\n- Driverless AI TAR SH, available from https://www.h2o.ai/download/\n\nNote: CUDA 11.2.2 (for GPUs) and cuDNN (required for TensorFlow support on GPUs) are included in the Driverless AI package." }, { "output": " To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used. CUDA LightGBM is only supported on Pascal-powered (and later) systems, and can be enabled manually with the ``enable_lightgbm_cuda_support`` config.toml setting." }, { "output": " .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. chmod 755 |VERSION-tar-lin|\n ./|VERSION-tar-lin|\n\nYou may now cd to the unpacked directory and optionally make changes to config.toml. Starting Driverless AI\n\n\n.. code-block:: bash\n \n # Start Driverless AI. ./run-dai.sh\n\nStarting NVIDIA Persistence Mode\n\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Run the following for Centos7/RH7 based systems using yum and x86. .. code-block:: bash\n\n yum -y clean all\n yum -y makecache\n yum -y update\n wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/c/clinfo-2.1.17.02.09-1.el7.x86_64.rpm\n wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/o/ocl-icd-2.2.12-1.el7.x86_64.rpm\n rpm -if clinfo-2.1.17.02.09-1.el7.x86_64.rpm\n rpm -if ocl-icd-2.2.12-1.el7.x86_64.rpm\n clinfo\n\n mkdir -p /etc/OpenCL/vendors && \\\n echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd\n\nLooking at Driverless AI log files\n\n\n.. code-block:: bash\n\n less log/dai.log\n less log/h2o.log\n less log/procsy.log\n less log/vis-server.log\n\nStopping Driverless AI\n\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " By default, all files for Driverless AI are contained within this directory. Upgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere. Driverless AI ships with CUDA 11.2.2 for GPUs, but the driver must exist in the host environment. Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers." }, { "output": " Experiment Settings\n=\n\nThis section includes settings that can be used to customize the experiment like total runtime, reproducibility level, pipeline building, feature brain control, adding config.toml settings and more. ``max_runtime_minutes``\n~\n\n.. dropdown:: Max Runtime in Minutes Before Triggering the Finish Button\n\t:open:\n\n\tSpecify the maximum runtime in minutes for an experiment. This is equivalent to pushing the Finish button once half of the specified time value has elapsed. Note that the overall enforced runtime is only an approximation." }, { "output": " The Finish button will be automatically selected once 12 hours have elapsed, and Driverless AI will subsequently attempt to complete the overall experiment in the remaining 12 hours. Set this value to 0 to disable this setting. Note that this setting applies to per experiment so if building leaderboard models(n) it will apply to each experiment separately(i.e total allowed runtime will be n*24hrs. This time estimate assumes running each experiment one at a time, sequentially)\n\n.. _max_runtime_minutes_until_abort:\n\n``max_runtime_minutes_until_abort``\n~\n\n.. dropdown:: Max Runtime in Minutes Before Triggering the Abort Button\n\t:open:\n\n\tSpecify the maximum runtime in minutes for an experiment before triggering the abort button." }, { "output": " This value defaults to 10080 mins (7 days). Note that this setting applies to per experiment so if building leaderboard models( say n), it will apply to each experiment separately(i.e total allowed runtime will be n*7days. This time estimate assumes running each experiment one at a time, sequentially). Also see :ref:`time_abort `. .. _time_abort:\n\n``time_abort``\n\n\n.. dropdown:: Time to Trigger the 'Abort' Button\n\t:open:\n\n\tIf the experiment is not done by this time, push the abort button." }, { "output": " Also see :ref:`max_runtime_minutes_until_abort ` for control over per experiment abort times. This accepts time in format given by time_abort_format (defaults to %Y-%m-%d %H:%M:%S).This assumes a timezone set by time_abort_timezone in config.toml(defaults to UTC). User can also specify integer seconds since 1970-01-01 00:00:00 UTC. This will apply to the time on a DAI worker that runs the experiments. Similar to :ref:`max_runtime_minutes_until_abort `, time abort will preserves experiment artifacts made so far for summary and log zip files." }, { "output": " .. _pipeline-building-recipe:\n\n``pipeline-building-recipe``\n\n\n.. dropdown:: Pipeline Building Recipe\n\t:open:\n\n\tSpecify the Pipeline Building recipe type (overrides GUI settings). Select from the following:\n\n\t- Auto: Specifies that all models and features are automatically determined by experiment settings, config.toml settings, and the feature engineering effort. (Default)\n\n\t- Compliant: Similar to Auto except for the following:\n\n\t\t- Interpretability is set to 10. - Only uses GLM or booster as 'giblinear'." }, { "output": " - :ref:`Feature brain level ` is set to 0. - Max feature interaction depth is set to 1 i.e no interactions. - Target transformers is set to 'identity' for regression. - Does not use :ref:`distribution shift ` detection. - :ref:`monotonicity_constraints_correlation_threshold ` is set to 0. - monotonic_gbm: Similar to Auto except for the following:\n\n\t\t- Enables monotonicity constraints\n\t\t- Only uses LightGBM model." }, { "output": " See :ref:`monotonicity-constraints-drop-low-correlation-features ` and :ref:`monotonicity-constraints-correlation-threshold `. - Does not build an ensemble model i.e set ``fixed_ensemble_level=0``\n\t\t- No :ref:`feature brain ` is used to ensure every restart is identical. - :ref:`Interaction depth ` is set to 1 i.e no multi-feature interactions done to avoid complexity." }, { "output": " The equivalent config.toml parameter is ``recipe=['monotonic_gbm']``. - :ref:`num_as_cat ` feature transformation is disabled. - List of included_transformers\n\t\t\n \t| 'OriginalTransformer', #numeric (no clustering, no interactions, no num->cat)\n \t| 'CatOriginalTransformer', 'RawTransformer','CVTargetEncodeTransformer', 'FrequentTransformer','WeightOfEvidenceTransformer','OneHotEncodingTransformer', #categorical (but no num-cat)\n \t| 'CatTransformer','StringConcatTransformer', # big data only\n \t| 'DateOriginalTransformer', 'DateTimeOriginalTransformer', 'DatesTransformer', 'DateTimeDiffTransformer', 'IsHolidayTransformer', 'LagsTransformer', 'EwmaLagsTransformer', 'LagsInteractionTransformer', 'LagsAggregatesTransformer',#dates/time\n \t| 'TextOriginalTransformer', 'TextTransformer', 'StrFeatureTransformer', 'TextCNNTransformer', 'TextBiGRUTransformer', 'TextCharCNNTransformer', 'BERTTransformer',#text\n \t| 'ImageOriginalTransformer', 'ImageVectorizerTransformer'] #image\n\n \tFor reference also see :ref:`Monotonicity Constraints in Driverless AI `." }, { "output": " - The test set is concatenated with the train set, with the target marked as missing\n\t\t- Transformers that do not use the target are allowed to ``fit_transform`` across the entirety of the train, validation, and test sets. - Has several config.toml expert options open-up limits. - nlp_model: Only enable NLP BERT models based on PyTorch to process pure text. To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended. For more information, see :ref:`nlp-in-dai`. - included_models = ['TextBERTModel', 'TextMultilingualBERTModel', 'TextXLNETModel', 'TextXLMModel','TextRoBERTaModel', 'TextDistilBERTModel', 'TextALBERTModel', 'TextCamemBERTModel', 'TextXLMRobertaModel']\n\t\t- enable_pytorch_nlp_transformer = 'off'\n\t\t- enable_pytorch_nlp_model = 'on'\n\n\t- nlp_transformer: Only enable PyTorch based BERT transformers that process pure text." }, { "output": " For more information, see :ref:`nlp-in-dai`. - included_transformers = ['BERTTransformer']\n\t\t- excluded_models = ['TextBERTModel', 'TextMultilingualBERTModel', 'TextXLNETModel', 'TextXLMModel','TextRoBERTaModel', 'TextDistilBERTModel', 'TextALBERTModel', 'TextCamemBERTModel', 'TextXLMRobertaModel']\n\t\t- enable_pytorch_nlp_transformer = 'on'\n\t\t- enable_pytorch_nlp_model = 'off'\n\n\t- image_model: Only enable image models that process pure images (ImageAutoModel). To avoid slowdown when using this recipe, enabling one or more GPUs is strongly recommended." }, { "output": " Notes:\n\n \t\t- This option disables the :ref:`Genetic Algorithm ` (GA). - Image insights are only available when this option is selected. - image_transformer: Only enable the ImageVectorizer transformer, which processes pure images. For more information, see :ref:`image-embeddings`. - unsupervised: Only enable unsupervised transformers, models and scorers. :ref:`See ` for reference. - gpus_max: Maximize use of GPUs (e.g. use XGBoost, RAPIDS, Optuna hyperparameter search, etc." }, { "output": " Each pipeline building recipe mode can be chosen, and then fine-tuned using each expert settings. Changing the pipeline building recipe will reset all pipeline building recipe options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of pipeline building recipe rules. If choose to do new/continued/refitted/retrained experiment from parent experiment, the recipe rules are not re-applied and any fine-tuning is preserved." }, { "output": " This way the new child experiment will use the default settings for the chosen recipe. .. _enable_genetic_algorithm:\n\n``enable_genetic_algorithm``\n\n\n.. dropdown:: Enable Genetic Algorithm for Selection and Tuning of Features and Models\n\t:open:\n\n\tSpecify whether to enable :ref:`genetic algorithm ` for selection and hyper-parameter tuning of features and models:\n\n\t- auto: Default value is 'auto'. This is same as 'on' unless it is a pure NLP or Image experiment. - on: Driverless AI genetic algorithm is used for feature engineering and model tuning and selection." }, { "output": " In the Optuna case, the scores shown in the iteration panel are the best score and trial scores. Optuna mode currently only uses Optuna for XGBoost, LightGBM, and CatBoost (custom recipe). If Pruner is enabled, as is default, Optuna mode disables mutations of evaluation metric (eval_metric) so pruning uses same metric across trials to compare. - off: When set to 'off', the final pipeline is trained using the default feature engineering and feature selection. THe equivalent config.toml parameter is ``enable_genetic_algorithm``." }, { "output": " This is set to Auto by default. Choose from the following:\n\n\t- auto: Choose based upon accuracy and interpretability\n\t- uniform: all individuals in population compete to win as best (can lead to all, e.g. LightGBM models in final ensemble, which may not improve ensemble performance due to lack of diversity)\n\t- fullstack: Choose from optimal model and feature types\n\t- feature: individuals with similar feature types compete (good if target encoding, frequency encoding, and other feature sets lead to good results)\n\t- model: individuals with same model type compete (good if multiple models do well but some models that do not do as well still contribute to improving ensemble)\n\n\tFor each case, a round robin approach is used to choose best scores among type of models to choose from." }, { "output": " The tournament is only used to prune-down individuals for, e.g., tuning -> evolution and evolution -> final model. ``make_python_scoring_pipeline``\n\n\n.. dropdown:: Make Python Scoring Pipeline\n\t:open:\n\n\tSpecify whether to automatically build a Python Scoring Pipeline for the experiment. Select On or Auto (default) to make the Python Scoring Pipeline immediately available for download when the experiment is finished. Select Off to disable the automatic creation of the Python Scoring Pipeline. ``make_mojo_scoring_pipeline``\n\n\n.. dropdown:: Make MOJO Scoring Pipeline\n\t:open:\n\n\tSpecify whether to automatically build a MOJO (Java) Scoring Pipeline for the experiment." }, { "output": " With this option, any capabilities that prevent the creation of the pipeline are dropped. Select Off to disable the automatic creation of the MOJO Scoring Pipeline. Select Auto (default) to attempt to create the MOJO Scoring Pipeline without dropping any capabilities. ``mojo_for_predictions``\n\n\n.. dropdown:: Allow Use of MOJO for Making Predictions\n\t:open:\n\n\tSpecify whether to use MOJO for making fast, low-latency predictions after the experiment has finished. When this is set to Auto (default), the MOJO is only used if the number of rows is equal to or below the value specified by ``mojo_for_predictions_max_rows``." }, { "output": " A smaller MOJO leads to less memory footprint during scoring. This setting attempts to reduce the mojo size by limiting experiment's maximum :ref:`interaction depth ` to 3, setting :ref:`ensemble level ` to 0 i.e no ensemble model for final pipeline and limiting the :ref:`maximum number of features ` in the model to 200. Note that these settings in some cases can affect the overall model's predictive accuracy as it is limiting the complexity of the feature engineering and model building space." }, { "output": " The equivalent config.toml setting is ``reduce_mojo_size``\n\n``make_pipeline_visualization``\n\n\n.. dropdown:: Make Pipeline Visualization\n\t:open:\n\n\tSpecify whether to create a visualization of the scoring pipeline at the end of an experiment. This is set to Auto by default. Note that the Visualize Scoring Pipeline feature is experimental and is not available for deprecated models. Visualizations are available for all newly created experiments. ``benchmark_mojo_latency``\n\n\n.. dropdown:: Measure MOJO Scoring Latency\n\t:open:\n\n\tSpecify whether to measure the MOJO scoring latency at the time of MOJO creation." }, { "output": " In this case, MOJO scoring latency will be measured if the pipeline.mojo file size is less than 100 MB. ``mojo_building_timeout``\n~\n\n.. dropdown:: Timeout in Seconds to Wait for MOJO Creation at End of Experiment\n\t:open:\n\n\tSpecify the amount of time in seconds to wait for MOJO creation at the end of an experiment. If the MOJO creation process times out, a MOJO can still be made from the GUI or the R and Python clients (the timeout constraint is not applied to these). This value defaults to 1800 sec (30 minutes)." }, { "output": " Higher values can speed up MOJO creation but use more memory. Set this value to -1 (default) to use all physical cores. ``kaggle_username``\n~\n\n.. dropdown:: Kaggle Username\n\t:open:\n\n\tOptionally specify your Kaggle username to enable automatic submission and scoring of test set predictions. If this option is specified, then you must also specify a value for the Kaggle Key option. If you don't have a Kaggle account, you can sign up at https://www.kaggle.com. ``kaggle_key``\n\n\n.. dropdown:: Kaggle Key\n\t:open:\n\n\tSpecify your Kaggle API key to enable automatic submission and scoring of test set predictions." }, { "output": " For more information on obtaining Kaggle API credentials, see https://github.com/Kaggle/kaggle-api#api-credentials. ``kaggle_timeout``\n\n\n.. dropdown:: Kaggle Submission Timeout in Seconds\n\t:open:\n\n\tSpecify the Kaggle submission timeout in seconds. This value defaults to 120 sec. ``min_num_rows``\n\n\n.. dropdown:: Min Number of Rows Needed to Run an Experiment\n\t:open:\n\n\tSpecify the minimum number of rows that a dataset must contain in order to run an experiment. This value defaults to 100. .. _reproducibility_level:\n\n``reproducibility_level``\n~\n\n.. dropdown:: Reproducibility Level\n\t:open:\n\n\tSpecify one of the following levels of reproducibility." }, { "output": " ``seed``\n\n\n.. dropdown:: Random Seed\n\t:open:\n\n\tSpecify a random seed for the experiment. When a seed is defined and the reproducible button is enabled (not by default), the algorithm will behave deterministically. ``allow_different_classes_across_fold_splits``\n\n\n.. dropdown:: Allow Different Sets of Classes Across All Train/Validation Fold Splits\n\t:open:\n\n\t(Note: Applicable for multiclass problems only.) Specify whether to enable full cross-validation (multiple folds) during feature evolution as opposed to a single holdout split." }, { "output": " ``save_validation_splits``\n\n\n.. dropdown:: Store Internal Validation Split Row Indices\n\t:open:\n\n\tSpecify whether to store internal validation split row indices. This includes pickles of (train_idx, valid_idx) tuples (numpy row indices for original training data) for all internal validation folds in the experiment summary ZIP file. Enable this setting for debugging purposes. This setting is disabled by default. ``max_num_classes``\n~\n\n.. dropdown:: Max Number of Classes for Classification Problems\n\t:open:\n\n\tSpecify the maximum number of classes to allow for a classification problem." }, { "output": " Memory requirements also increase with a higher number of classes. This value defaults to 200. ``max_num_classes_compute_roc``\n~\n\n.. dropdown:: Max Number of Classes to Compute ROC and Confusion Matrix for Classification Problems\n\n\tSpecify the maximum number of classes to use when computing the ROC and CM. When this value is exceeded, the reduction type specified by ``roc_reduce_type`` is applied. This value defaults to 200 and cannot be lower than 2. ``max_num_classes_client_and_gui``\n\n\n.. dropdown:: Max Number of Classes to Show in GUI for Confusion Matrix\n\t:open:\n\n\tSpecify the maximum number of classes to show in the GUI for CM, showing first ``max_num_classes_client_and_gui`` labels." }, { "output": " Note that if this value is changed in the config.toml and the server is restarted, then this setting will only modify client-GUI launched diagnostics. To control experiment plots, this value must be changed in the expert settings panel. ``roc_reduce_type``\n~\n\n.. dropdown:: ROC/CM Reduction Technique for Large Class Counts\n\t:open:\n\n\tSpecify the ROC confusion matrix reduction technique used for large class counts:\n\n\t- Rows (Default): Reduce by randomly sampling rows\n\t- Classes: Reduce by truncating classes to no more than the value specified by ``max_num_classes_compute_roc``\n\n``max_rows_cm_ga``\n\n\n.. dropdown:: Maximum Number of Rows to Obtain Confusion Matrix Related Plots During Feature Evolution\n\t:open:\n\n\tSpecify the maximum number of rows to obtain confusion matrix related plots during feature evolution." }, { "output": " ``use_feature_brain_new_experiments``\n~\n\n.. dropdown:: Whether to Use Feature Brain for New Experiments\n\t:open:\n\n\tSpecify whether to use feature_brain results even if running new experiments. Feature brain can be risky with some types of changes to experiment setup. Even rescoring may be insufficient, so by default this is False. For example, one experiment may have training=external validation by accident, and get high score, and while feature_brain_reset_score='on' means we will rescore, it will have already seen during training the external validation and leak that data as part of what it learned from." }, { "output": " .. _feature_brain1:\n\n``feature_brain_level``\n~\n\n.. dropdown:: Model/Feature Brain Level\n\t:open:\n\n\tSpecify whether to use H2O.ai brain, which enables local caching and smart re-use (checkpointing) of prior experiments to generate useful features and models for new experiments. It can also be used to control checkpointing for experiments that have been paused or interrupted. When enabled, this will use the H2O.ai brain cache if the cache file:\n\n\t - has any matching column names and types for a similar experiment type\n\t - has classes that match exactly\n\t - has class labels that match exactly\n\t - has basic time series choices that match\n\t - the interpretability of the cache is equal or lower\n\t - the main model (booster) is allowed by the new experiment\n\n\t- -1: Don't use any brain cache (default)\n\t- 0: Don't use any brain cache but still write to cache." }, { "output": " - 1: Smart checkpoint from the latest best individual model. Use case: Want to use the latest matching model. The match may not be precise, so use with caution. - 2: Smart checkpoint if the experiment matches all column names, column types, classes, class labels, and time series options identically. Use case: Driverless AI scans through the H2O.ai brain cache for the best models to restart from. - 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size." }, { "output": " - 4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient size. Note that this will re-score the entire population in a single iteration, so it appears to take longer to complete first iteration. - 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations to get the best scored individuals. Note that this can be slower due to brain cache scanning if the cache is large. When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain." }, { "output": " Both the directory and the maximum size can be changed in the config.toml file. This value defaults to 2. .. _feature_brain2:\n\n``feature_brain2``\n\n\n.. dropdown:: Feature Brain Save Every Which Iteration\n\t:open:\n\n\tSave feature brain iterations every iter_num % feature_brain_iterations_save_every_iteration 0, to be able to restart/refit with which_iteration_brain >= 0. This is disabled (0) by default. - -1: Don't use any brain cache. - 0: Don't use any brain cache but still write to cache. - 1: Smart checkpoint if an old experiment_id is passed in (for example, via running \"resume one like this\" in the GUI)." }, { "output": " (default)\n\t- 3: Smart checkpoint like level #1 but for the entire population. Tune only if the brain population is of insufficient size. - 4: Smart checkpoint like level #2 but for the entire population. Tune only if the brain population is of insufficient size. - 5: Smart checkpoint like level #4 but will scan over the entire brain cache of populations (starting from resumed experiment if chosen) in order to get the best scored individuals. When enabled, the directory where the H2O.ai Brain meta model files are stored is H2O.ai_brain." }, { "output": " Both the directory and the maximum size can be changed in the config.toml file. .. _feature_brain3:\n\n``feature_brain3``\n\n.. dropdown:: Feature Brain Restart from Which Iteration\n\t:open:\n\n\tWhen performing restart or re-fit of type feature_brain_level with a resumed ID, specify which iteration to start from instead of only last best. Available options include:\n\n\t- -1: Use the last best\n\t- 1: Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number\n\t- 2: Identify which iteration brain dump you wants to restart/refit from\n\t- 3: Restart/Refit from the original experiment, setting which_iteration_brain to that number here in expert settings." }, { "output": " This value defaults to -1. .. _feature_brain4:\n\n``feature_brain4``\n\n\n.. dropdown:: Feature Brain Refit Uses Same Best Individual\n\t:open:\n\n\tSpecify whether to use the same best individual when performing a refit. Disabling this setting allows the order of best individuals to be rearranged, leading to a better final result. Enabling this setting lets you view the exact same model or feature with only one new feature added. This is disabled by default. .. _feature_brain5:\n\n``feature_brain5``\n\n\n.. dropdown:: Feature Brain Adds Features with New Columns Even During Retraining of Final Model\n\t:open:\n\n\tSpecify whether to add additional features from new columns to the pipeline, even when performing a retrain of the final model." }, { "output": " New data may lead to new dropped features due to shift or leak detection. Disable this to avoid adding any columns as new features so that the pipeline is perfectly preserved when changing data. This is enabled by default. ``force_model_restart_to_defaults``\n~\n\n.. dropdown:: Restart-Refit Use Default Model Settings If Model Switches\n\t:open:\n\n\tWhen restarting or refitting, specify whether to use the model class's default settings if the original model class is no longer available. If this is disabled, the original hyperparameters will be used instead." }, { "output": " This is enabled by default. ``min_dai_iterations``\n\n\n.. dropdown:: Min DAI Iterations\n\t:open:\n\n\tSpecify the minimum number of Driverless AI iterations for an experiment. This can be used during restarting, when you want to continue for longer despite a score not improving. This value defaults to 0. .. _target_transformer:\n\n``target_transformer``\n\n\n.. dropdown:: Select Target Transformation of the Target for Regression Problems\n\t:open:\n\n\tSpecify whether to automatically select target transformation for regression problems." }, { "output": " Selecting identity_noclip automatically turns off any target transformations. All transformers except for center, standardize, identity_noclip and log_noclip perform clipping to constrain the predictions to the domain of the target in the training data, so avoid them if you want to enable extrapolations. The equivalent config.toml setting is ``target_transformer``. ``fixed_num_folds_evolution``\n~\n\n.. dropdown:: Number of Cross-Validation Folds for Feature Evolution\n\t:open:\n\n\tSpecify the fixed number of cross-validation folds (if >= 2) for feature evolution." }, { "output": " This value defaults to -1 (auto). ``fixed_num_folds``\n~\n\n.. dropdown:: Number of Cross-Validation Folds for Final Model\n\t:open:\n\n\tSpecify the fixed number of cross-validation folds (if >= 2) for the final model. Note that the actual number of allowed folds can be less than the specified value, and that the number of allowed folds is determined at the time an experiment is run. This value defaults to -1 (auto). ``fixed_only_first_fold_model``\n~\n\n.. dropdown:: Force Only First Fold for Models\n\t:open:\n\n\tSpecify whether to force only the first fold for models." }, { "output": " Set \"on\" to force only first fold for models.This is useful for quick runs regardless of data\n\n``feature_evolution_data_size``\n~\n\n.. dropdown:: Max Number of Rows Times Number of Columns for Feature Evolution Data Splits\n\t:open:\n\n\tSpecify the maximum number of rows allowed for feature evolution data splits (not for the final pipeline). This value defaults to 100,000,000. ``final_pipeline_data_size``\n\n\n.. dropdown:: Max Number of Rows Times Number of Columns for Reducing Training Dataset\n\t:open:\n\n\tSpecify the upper limit on the number of rows times the number of columns for training the final pipeline." }, { "output": " ``max_validation_to_training_size_ratio_for_final_ensemble``\n\n\n.. dropdown:: Maximum Size of Validation Data Relative to Training Data\n\t:open:\n\n\tSpecify the maximum size of the validation data relative to the training data. Smaller values can make the final pipeline model training process quicker. Note that final model predictions and scores will always be provided on the full dataset provided. This value defaults to 2.0. ``force_stratified_splits_for_imbalanced_threshold_binary``\n~\n\n.. dropdown:: Perform Stratified Sampling for Binary Classification If the Target Is More Imbalanced Than This\n\t:open:\n\n\tFor binary classification experiments, specify a threshold ratio of minority to majority class for the target column beyond which stratified sampling is performed." }, { "output": " This value defaults to 0.01. You can choose to always perform random sampling by setting this value to 0, or to always perform stratified sampling by setting this value to 1. .. _config_overrides:\n\n``config_overrides``\n\n\n.. dropdown:: Add to config.toml via TOML String\n\t:open:\n\n\tSpecify any additional configuration overrides from the config.toml file that you want to include in the experiment. (Refer to the :ref:`sample-configtoml` section to view options that can be overridden during an experiment.)" }, { "output": " Separate multiple config overrides with ``\\n``. For example, the following enables Poisson distribution for LightGBM and disables Target Transformer Tuning. Note that in this example double quotes are escaped (``\\\" \\\"``). ::\n\n\t params_lightgbm=\\\"{'objective':'poisson'}\\\" \\n target_transformer=identity\n\n\tOr you can specify config overrides similar to the following without having to escape double quotes:\n\n\t::\n\n\t \"\"enable_glm=\"off\" \\n enable_xgboost_gbm=\"off\" \\n enable_lightgbm=\"off\" \\n enable_tensorflow=\"on\"\"\"\n\t \"\"max_cores=10 \\n data_precision=\"float32\" \\n max_rows_feature_evolution=50000000000 \\n ensemble_accuracy_switch=11 \\n feature_engineering_effort=1 \\n target_transformer=\"identity\" \\n tournament_feature_style_accuracy_switch=5 \\n params_tensorflow=\"{'layers': [100, 100, 100, 100, 100, 100]}\"\"\"\n\n\tWhen running the Python client, config overrides would be set as follows:\n\n\t::\n\n\t\tmodel = h2o.start_experiment_sync(\n\t\t dataset_key=train.key,\n\t\t target_col='target',\n\t\t is_classification=True,\n\t\t accuracy=7,\n\t\t time=5,\n\t\t interpretability=1,\n\t\t config_overrides=\"\"\"\n\t\t feature_brain_level=0\n\t\t enable_lightgbm=\"off\"\n\t\t enable_xgboost_gbm=\"off\"\n\t\t enable_ftrl=\"off\"\n\t\t \"\"\"\n\t\t)\n\n``last_recipe``\n~\n\n.. dropdown:: last_recipe\n\t:open:\n\n\tInternal helper to allow memory of if changed recipe\n\n``feature_brain_reset_score``\n~\n\n.. dropdown:: Whether to re-score models from brain cache\n\t:open:\n\n\tSpecify whether to smartly keep score to avoid re-munging/re-training/re-scoring steps brain models ('auto'), always force all steps for all brain imports ('on'), or never rescore ('off')." }, { "output": " 'on' is useful when smart similarity checking is not reliable enough. 'off' is useful when know want to keep exact same features and model for final model refit, despite changes in seed or other behaviors in features that might change the outcome if re-scored before reaching final model. If set off, then no limits are applied to features during brain ingestion, while can set brain_add_features_for_new_columns to false if want to ignore any new columns in data. Can also set refit_same_best_individual True if want exact same best individual (highest scored model+features) to be used regardless of any scoring changes." }, { "output": " Set to 0 to disable this setting. ``which_iteration_brain``\n~\n\n.. dropdown:: Feature Brain Restart from which iteration\n\t:open:\n\n\tWhen performing restart or re-fit type feature_brain_level with resumed_experiment_id, choose which iteration to start from, instead of only last best -1 means just use last best. Usage:\n\n - 1) Run one experiment with feature_brain_iterations_save_every_iteration=1 or some other number\n - 2) Identify which iteration brain dump one wants to restart/refit from\n - 3) Restart/Refit from original experiment, setting which_iteration_brain to that number in expert settings\n\n\tNote: If restart from a tuning iteration, this will pull in entire scored tuning population and use that for feature evolution." }, { "output": " But sometimes you want to see exact same model/features with only one feature added, and then would need to set this to True case. That is, if refit with just 1 extra column and have interpretability=1, then final model will be same features, with one more engineered feature applied to that new original feature. ``restart_refit_redo_origfs_shift_leak``\n\n\n.. dropdown:: For restart-refit, select which steps to do\n\t:open:\n\n\tWhen doing restart or re-fit of experiment from feature brain, sometimes user might change data significantly and then warrant redoing reduction of original features by feature selection, shift detection, and leakage detection." }, { "output": " due to random seed if not setting reproducible mode), leading to changes in features and model that is refitted. By default, restart and refit avoid these steps assuming data and experiment setup have no changed significantly. If check_distribution_shift is forced to on (instead of auto), then this option is ignored. In order to ensure exact same final pipeline is fitted, one should also set:\n\n\t- 1) brain_add_features_for_new_columns false\n\t- 2) refit_same_best_individual true\n\t- 3) feature_brain_reset_score 'off'\n\t- 4) force_model_restart_to_defaults false\n\n\tThe score will still be reset if the experiment metric chosen changes, but changes to the scored model and features will be more frozen in place." }, { "output": " In some cases, one might have a new dataset but only want to keep same pipeline regardless of new columns, in which case one sets this to False. For example, new data might lead to new dropped features, due to shift or leak detection. To avoid change of feature set, one can disable all dropping of columns, but set this to False to avoid adding any columns as new features, so pipeline is perfectly preserved when changing data. ``force_model_restart_to_defaults``\n\n\n.. dropdown:: Restart-refit use default model settings if model switches\n\t:open:\n\n\tIf restart/refit and no longer have the original model class available, be conservative and go back to defaults for that model class." }, { "output": " ``dump_modelparams_every_scored_indiv``\n~\n\n.. dropdown:: Enable detailed scored model info\n\t:open:\n\n\tWhether to dump every scored individual's model parameters to csv/tabulated/json file produces files. For example: individual_scored.params. [txt, csv, json]\n\n.. _fast-approx-trees:\n\n``fast_approx_num_trees``\n~\n\n.. dropdown:: Max number of trees to use for fast approximation\n\t:open:\n\n\tWhen ``fast_approx=True``, specify the maximum number of trees to use. By default, this value is 250. .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions." }, { "output": " By default, this setting is enabled. .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions. .. _fast-approx-one-model:\n\n``fast_approx_do_one_model``\n\n\n.. dropdown:: Whether to use only one model for fast approximation\n\t:open:\n\n\tWhen ``fast_approx=True``, specify whether to speed up fast approximation further by using only one model out of all ensemble models. By default, this setting is disabled. .. note::\n By default, ``fast_approx`` is enabled for MLI and AutoDoc and disabled for Experiment predictions." }, { "output": " By default, this value is 50. .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc. .. _fast-approx-one-fold-shap:\n\n``fast_approx_contribs_do_one_fold``\n\n\n.. dropdown:: Whether to use only one fold for fast approximation when making Shapley predictions\n\t:open:\n\n\tWhen ``fast_approx_contribs=True``, specify whether to speed up ``fast_approx_contribs`` further by using only one fold out of all cross-validation folds for 'Fast Approximation' in GUI when making Shapley predictions and for AutoDoc/MLI." }, { "output": " .. note::\n By default, ``fast_approx_contribs`` is enabled for MLI and AutoDoc. .. _fast-approx-one-model-shap:\n\n``fast_approx_contribs_do_one_model``\n~\n\n.. dropdown:: Whether to use only one model for fast approximation when making Shapley predictions\n\t:open:\n\n\tWhen ``fast_approx_contribs=True``, specify whether to speed up ``fast_approx_contribs`` further by using only one model out of all ensemble models for 'Fast Approximation' in GUI when making Shapley predictions and for AutoDoc/MLI." }, { "output": " .. _linux-rpms:\n\nLinux RPMs\n\n\nFor Linux machines that will not use the Docker image or DEB, an RPM installation is available for the following environments:\n\n- x86_64 RHEL 7 / RHEL 8\n- CentOS 7 / CentOS 8\n\nThe installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " - When using systemd, remove the ``dai-minio``, ``dai-h2o``, ``dai-redis``, ``dai-procsy``, and ``dai-vis-server`` services. When upgrading, you can use the following commands to deactivate these services:\n\n ::\n\n systemctl stop dai-minio\n systemctl disable dai-minio\n systemctl stop dai-h2o\n systemctl disable dai-h2o\n systemctl stop dai-redis\n systemctl disable dai-redis\n systemctl stop dai-procsy\n systemctl disable dai-procsy\n systemctl stop dai-vis-server\n systemctl disable dai-vis-server\n\nEnvironment\n~\n\n+-+-+\n| Operating System | Min Mem |\n+=+=+\n| RHEL with GPUs | 64 GB |\n+-+-+\n| RHEL with CPUs | 64 GB |\n+-+-+\n| CentOS with GPUS | 64 GB |\n+-+-+\n| CentOS with CPUs | 64 GB |\n+-+-+\n\nRequirements\n\n\n- RedHat 7/RedHat 8/CentOS 7/CentOS 8\n- NVIDIA drivers >= |NVIDIA-driver-ver| recommended (GPU only)." }, { "output": " About the Install\n~\n\n.. include:: linux-rpmdeb-about.frag\n\nInstalling OpenCL\n~\n\nOpenCL is required for full LightGBM support on GPU-powered systems. To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used. CUDA LightGBM is only supported on Pascal-powered (and later) systems, and can be enabled manually with the ``enable_lightgbm_cuda_support`` config.toml setting." }, { "output": " .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. sudo rpm -i |VERSION-rpm-lin|\n\n\nNote: For RHEL 7.5, it is necessary to upgrade library glib2:\n\n.. code-block:: bash\n\n sudo yum upgrade glib2\n\nBy default, the Driverless AI processes are owned by the 'dai' user and 'dai' group. You can optionally specify a different service user and group as shown below. Replace and as appropriate. .. code-block:: bash\n :substitutions:\n\n # Temporarily specify service user and group when installing Driverless AI." }, { "output": " sudo DAI_USER=myuser DAI_GROUP=mygroup rpm -i |VERSION-rpm-lin|\n\nYou may now optionally make changes to /etc/dai/config.toml. Starting Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Start Driverless AI. sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Start Driverless AI. sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nStarting NVIDIA Persistence Mode\n\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every reboot." }, { "output": " .. include:: enable-persistence.rst\n\nLooking at Driverless AI log files\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n sudo systemctl status dai-dai\n sudo journalctl -u dai-dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n sudo less /opt/h2oai/dai/log/dai.log\n sudo less /opt/h2oai/dai/log/h2o.log\n sudo less /opt/h2oai/dai/log/procsy.log\n sudo less /opt/h2oai/dai/log/vis-server.log\n\nStopping Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\nUpgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere. Driverless AI ships with CUDA 11.2.2 for GPUs, but the driver must exist in the host environment." }, { "output": " For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ . .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " sudo rpm -U |VERSION-rpm-lin|\n sudo systemctl daemon-reload\n sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time. # Upgrade and restart. sudo rpm -U |VERSION-rpm-lin|\n sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nUninstalling Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall. sudo rpm -e dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Uninstall. sudo rpm -e dai\n\nCAUTION! At this point you can optionally completely remove all remaining files, including the database. (This cannot be undone.) .. code-block:: bash\n\n sudo rm -rf /opt/h2oai/dai\n sudo rm -rf /etc/dai\n\nNote: The UID and GID are not removed during the uninstall process." }, { "output": " .. _linux-deb:\n\nLinux DEBs\n\n\nFor Linux machines that will not use the Docker image or RPM, a deb installation is available for x86_64 Ubuntu 16.04/18.04/20.04/22.04. The following installation steps assume that you have a valid license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://www.h2o.ai/products/h2o-driverless-ai/. Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " - When using systemd, remove the ``dai-minio``, ``dai-h2o``, ``dai-redis``, ``dai-procsy``, and ``dai-vis-server`` services. When upgrading, you can use the following commands to deactivate these services:\n\n ::\n\n systemctl stop dai-minio\n systemctl disable dai-minio\n systemctl stop dai-h2o\n systemctl disable dai-h2o\n systemctl stop dai-redis\n systemctl disable dai-redis\n systemctl stop dai-procsy\n systemctl disable dai-procsy\n systemctl stop dai-vis-server\n systemctl disable dai-vis-server\n\nEnvironment\n~\n\n+-+-+\n| Operating System | Min Mem |\n+=+=+\n| Ubuntu with GPUs | 64 GB |\n+-+-+\n| Ubuntu with CPUs | 64 GB |\n+-+-+\n\nRequirements\n\n\n- Ubuntu 16.04/Ubuntu 18.04/Ubuntu 20.04/Ubuntu 22.04\n- NVIDIA drivers >= |NVIDIA-driver-ver| is recommended (GPU only)." }, { "output": " About the Install\n~\n\n.. include:: linux-rpmdeb-about.frag\n\nStarting NVIDIA Persistence Mode (GPU only)\n~\n\nIf you have NVIDIA GPUs, you must run the following NVIDIA command. This command needs to be run every reboot. For more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\nInstalling OpenCL\n~\n\nOpenCL is required for full LightGBM support on GPU-powered systems. To install OpenCL, run the following as root:\n\n.. code-block:: bash\n\n mkdir -p /etc/OpenCL/vendors && echo \"libnvidia-opencl.so.1\" > /etc/OpenCL/vendors/nvidia.icd && chmod a+r /etc/OpenCL/vendors/nvidia.icd && chmod a+x /etc/OpenCL/vendors/ && chmod a+x /etc/OpenCL\n\n.. note::\n\tIf OpenCL is not installed, then CUDA LightGBM is automatically used." }, { "output": " Installing the Driverless AI Linux DEB\n\n\nRun the following commands to install the Driverless AI DEB. .. code-block:: bash\n :substitutions:\n\n # Install Driverless AI. sudo dpkg -i |VERSION-deb-lin|\n\nBy default, the Driverless AI processes are owned by the 'dai' user and 'dai' group. You can optionally specify a different service user and group as shown below. Replace and as appropriate. .. code-block:: bash\n :substitutions:\n\n # Temporarily specify service user and group when installing Driverless AI." }, { "output": " sudo DAI_USER=myuser DAI_GROUP=mygroup dpkg -i |VERSION-deb-lin|\n\nYou may now optionally make changes to /etc/dai/config.toml. Starting Driverless AI\n\n\nTo start Driverless AI, use the following command:\n\n.. code-block:: bash\n\n # Start Driverless AI. sudo systemctl start dai\n\nNote: If you don't have systemd, refer to :ref:`linux-tarsh` for install instructions. Viewing Driverless AI Log Files\n~\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n sudo systemctl status dai-dai\n sudo journalctl -u dai-dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n sudo less /opt/h2oai/dai/log/dai.log\n sudo less /opt/h2oai/dai/log/h2o.log\n sudo less /opt/h2oai/dai/log/procsy.log\n sudo less /opt/h2oai/dai/log/vis-server.log\n\nStopping Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n\nUpgrading Driverless AI\n~\n\n.. include:: upgrade-warning.frag\n\nRequirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere. Driverless AI ships with CUDA 11.2.2 for GPUs, but the driver must exist in the host environment." }, { "output": " For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ . .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " sudo dpkg -i |VERSION-deb-lin|\n sudo systemctl daemon-reload\n sudo systemctl start dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n :substitutions:\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time. If you do not, all previous data will be lost. # Upgrade and restart. sudo dpkg -i |VERSION-deb-lin|\n sudo -H -u dai /opt/h2oai/dai/run-dai.sh\n\nUninstalling Driverless AI\n\n\nIf you have systemd (preferred):\n\n.. code-block:: bash\n\n # Stop Driverless AI." }, { "output": " Verify. sudo ps -u dai\n\n # Uninstall Driverless AI. sudo dpkg -r dai\n\n # Purge Driverless AI. sudo dpkg -P dai\n\nIf you do not have systemd:\n\n.. code-block:: bash\n\n # Stop Driverless AI. sudo pkill -U dai\n\n # The processes should now be stopped. Verify. sudo ps -u dai\n\n # Uninstall Driverless AI. sudo dpkg -r dai\n\n # Purge Driverless AI. sudo dpkg -P dai\n\nCAUTION! At this point you can optionally completely remove all remaining files, including the database (this cannot be undone):\n\n.. code-block:: bash\n\n sudo rm -rf /opt/h2oai/dai\n sudo rm -rf /etc/dai\n\nNote: The UID and GID are not removed during the uninstall process." }, { "output": " However, we DO NOT recommend removing the UID and GID if you plan to re-install Driverless AI. If you remove the UID and GID and then reinstall Driverless AI, the UID and GID will likely be re-assigned to a different (unrelated) user/group in the future; this may cause confusion if there are any remaining files on the filesystem referring to the deleted user or group. Common Problems\n~\n\nStart of Driverless AI fails on the message ``Segmentation fault (core dumped)`` on Ubuntu 18. This problem is caused by the font ``NotoColorEmoji.ttf``, which cannot be processed by the Python matplotlib library." }, { "output": " .. _install-on-nvidia-dgx:\n\nInstall on NVIDIA GPU Cloud/NGC Registry\n\n\nDriverless AI is supported on the following NVIDIA DGX products, and the installation steps for each platform are the same. - `NVIDIA GPU Cloud `__\n- `NVIDIA DGX-1 `__\n- `NVIDIA DGX-2 `__\n- `NVIDIA DGX Station `__\n\nEnvironment\n~\n\n+++++\n| Provider | GPUs | Min Memory | Suitable for |\n+++++\n| NVIDIA GPU Cloud | Yes | | Serious use |\n+++++\n| NVIDIA DGX-1/DGX-2 | Yes | 128 GB | Serious use |\n+++++\n| NVIDIA DGX Station | Yes | 64 GB | Serious Use | \n+++++\n\nInstalling the NVIDIA NGC Registry\n\n\nNote: These installation instructions assume that you are running on an NVIDIA DGX machine." }, { "output": " 1. Log in to your NVIDIA GPU Cloud account at https://ngc.nvidia.com/registry. (Note that NVIDIA Compute is no longer supported by NVIDIA.) 2. In the Registry > Partners menu, select h2oai-driverless. .. image:: ../images/ngc_select_dai.png\n :align: center\n\n3. At the bottom of the screen, select one of the H2O Driverless AI tags to retrieve the pull command. .. image:: ../images/ngc_select_tag.png\n :align: center\n\n4. On your NVIDIA DGX machine, open a command prompt and use the specified pull command to retrieve the Driverless AI image." }, { "output": " Set up a directory for the version of Driverless AI on the host machine: \n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n6. Set up the data, log, license, and tmp directories on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the directory associated with the selected version of Driverless AI\n cd |VERSION-dir|\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n7." }, { "output": " The data will be visible inside the Docker container. 8. Enable persistence of the GPU. Note that this only needs to be run once. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\n9. Run ``docker images`` to find the new image tag. 10. Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command." }, { "output": " We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. Note: Use ``docker version`` to check which version of Docker you are using. .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n11." }, { "output": " Upgrading Driverless AI\n~\n\nThe steps for upgrading Driverless AI on an NVIDIA DGX system are similar to the installation steps. .. include:: upgrade-warning.frag\n \nNote: Use Ctrl+C to stop Driverless AI if it is still running. Requirements\n\n\nAs of 1.7.0, CUDA 9 is no longer supported. Your host environment must have CUDA 10.0 or later with NVIDIA drivers >= 440.82 installed (GPU only). Driverless AI ships with its own CUDA libraries, but the driver must exist in the host environment. Go to https://www.nvidia.com/Download/index.aspx to get the latest NVIDIA Tesla V/P/K series driver." }, { "output": " AWS Role-Based Authentication\n~\n\nIn Driverless AI, it is possible to enable role-based authentication via the `IAM role `__. This is a two-step process that involves setting up AWS IAM and then starting Driverless AI by specifying the role in the config.toml file or by setting the ``AWS_USE_EC2_ROLE_CREDENTIALS`` environment variable to ``True``. AWS IAM Setup\n'\n\n1. Create an IAM role. This IAM role should have a Trust Relationship with Principal Trust Entity set to your Account ID." }, { "output": " Create a new policy that lets users assume the role:\n\n .. image:: ../images/aws_iam_policy_create.png\n\n3. Assign the policy to the user. .. image:: ../images/aws_iam_policy_assign.png\n\n4. Test role switching here: https://signin.aws.amazon.com/switchrole. (Refer to https://docs.aws.amazon.com/IAM/latest/UserGuide/troubleshoot_roles.html#troubleshoot_roles_cant-assume-role.) Driverless AI Setup\n'\n\nUpdate the ``aws_use_ec2_role_credentials`` config variable in the config.toml file or start Driverless AI using the ``AWS_USE_EC2_ROLE_CREDENTIALS`` environment variable." }, { "output": " .. _system-settings:\n\nSystem Settings\n=\n\n.. _exclusive_mode:\n\n``exclusive_mode``\n\n\n.. dropdown:: Exclusive level of access to node resources\n\t:open:\n\n\tThere are three levels of access:\n\n\t\t- safe: this level assumes that there might be another experiment also running on same node. - moderate: this level assumes that there are no other experiments or tasks running on the same node, but still only uses physical core counts. - max: this level assumes that there is absolutly nothing else running on the node except the experiment\n\n\tThe default level is \"safe\" and the equivalent config.toml parameter is ``exclusive_mode``." }, { "output": " Each exclusive mode can be chosen, and then fine-tuned using each expert settings. Changing the exclusive mode will reset all exclusive mode related options back to default and then re-apply the specific rules for the new mode, which will undo any fine-tuning of expert options that are part of exclusive mode rules. If you choose to do new/continued/refitted/retrained experiment from parent experiment, all the mode rules are not re-applied and any fine-tuning is preserved. To reset mode behavior, one can switch between 'safe' and the desired mode." }, { "output": " ``max_cores``\n~\n\n.. dropdown:: Number of Cores to Use\n\t:open:\n\n\tSpecify the number of cores to use per experiment. Note that if you specify 0, all available cores will be used. Lower values can reduce memory usage but might slow down the experiment. This value defaults to 0(all). One can also set it using the environment variable OMP_NUM_THREADS or OPENBLAS_NUM_THREADS (e.g., in bash: 'export OMP_NUM_THREADS=32' or 'export OPENBLAS_NUM_THREADS=32')\n\n``max_fit_cores``\n~\n\n.. dropdown:: Maximum Number of Cores to Use for Model Fit\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's fit call." }, { "output": " This value defaults to 10. .. _use_dask_cluster:\n\n``use_dask_cluster``\n\n\n.. dropdown:: If full dask cluster is enabled, use full cluster\n\t:open:\n\n\tSpecify whether to use full multinode distributed cluster (True) or single-node dask (False). In some cases, using entire cluster can be inefficient. E.g. several DGX nodes can be more efficient, if used one DGX at a time for medium-sized data. The equivalent config.toml parameter is ``use_dask_cluster``. ``max_predict_cores``\n~\n\n.. dropdown:: Maximum Number of Cores to Use for Model Predict\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's predict call." }, { "output": " This value defaults to 0(all). ``max_predict_cores_in_dai``\n\n\n.. dropdown:: Maximum Number of Cores to Use for Model Transform and Predict When Doing MLI, AutoDoc\n\t:open:\n\n\tSpecify the maximum number of cores to use for a model's transform and predict call when doing operations in the Driverless AI MLI GUI and the Driverless AI R and Python clients. Note that if you specify 0, all available cores will be used. This value defaults to 4. ``batch_cpu_tuning_max_workers``\n\n\n.. dropdown:: Tuning Workers per Batch for CPU\n\t:open:\n\n\tSpecify the number of workers used in CPU mode for tuning." }, { "output": " This value defaults to 0(socket count). ``cpu_max_workers``\n~\n.. dropdown:: Number of Workers for CPU Training\n\t:open:\n\n\tSpecify the number of workers used in CPU mode for training:\n\n\t- 0: Use socket count (Default)\n\t- -1: Use all physical cores >= 1 that count\n\n.. _num_gpus_per_experiment:\n\n``num_gpus_per_experiment``\n~\n\n.. dropdown:: #GPUs/Experiment\n\t:open:\n\n\tSpecify the number of GPUs to use per experiment. A value of -1 (default) specifies to use all available GPUs. Must be at least as large as the number of GPUs to use per model (or -1)." }, { "output": " ``min_num_cores_per_gpu``\n~\n.. dropdown:: Num Cores/GPU\n\t:open:\n\n\tSpecify the number of CPU cores per GPU. In order to have a sufficient number of cores per GPU, this setting limits the number of GPUs used. This value defaults to 2. .. _num-gpus-per-model:\n\n``num_gpus_per_model``\n\n.. dropdown:: #GPUs/Model\n\t:open:\n\n\tSpecify the number of GPUs to user per model. The equivalent config.toml parameter is ``num_gpus_per_model`` and the default value is 1. Currently num_gpus_per_model other than 1 disables GPU locking, so is only recommended for single experiments and single users." }, { "output": " In all cases, XGBoost tree and linear models use the number of GPUs specified per model, while LightGBM and Tensorflow revert to using 1 GPU/model and run multiple models on multiple GPUs. FTRL does not use GPUs. Rulefit uses GPUs for parts involving obtaining the tree using LightGBM. In multinode context when using dask, this parameter refers to the per-node value. .. _num-gpus-for-prediction:\n\n``num_gpus_for_prediction``\n~\n\n.. dropdown:: Num. of GPUs for Isolated Prediction/Transform\n\t:open:\n\n\tSpecify the number of GPUs to use for ``predict`` for models and ``transform`` for transformers when running outside of ``fit``/``fit_transform``." }, { "output": " New processes will use this count for applicable models and transformers. Note that enabling ``tensorflow_nlp_have_gpus_in_production`` will override this setting for relevant TensorFlow NLP transformers. The equivalent config.toml parameter is ``num_gpus_for_prediction`` and the default value is \"0\". Note: When GPUs are used, TensorFlow, PyTorch models and transformers, and RAPIDS always predict on GPU. And RAPIDS requires Driverless AI python scoring package also to be used on GPUs. In multinode context when using dask, this refers to the per-node value." }, { "output": " If using CUDA_VISIBLE_DEVICES=... to control GPUs (preferred method), gpu_id=0 is the\n\tfirst in that restricted list of devices. For example, if ``CUDA_VISIBLE_DEVICES='4,5'`` then ``gpu_id_start=0`` will refer to device #4. From expert mode, to run 2 experiments, each on a distinct GPU out of 2 GPUs, then:\n\n\t- Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=0\n\t- Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=1, gpu_id_start=1\n\n\tFrom expert mode, to run 2 experiments, each on a distinct GPU out of 8 GPUs, then:\n\n\t- Experiment#1: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=0\n\t- Experiment#2: num_gpus_per_model=1, num_gpus_per_experiment=4, gpu_id_start=4\n\n\tTo run on all 4 GPUs/model, then\n\n\t- Experiment#1: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=0\n\t- Experiment#2: num_gpus_per_model=4, num_gpus_per_experiment=4, gpu_id_start=4\n\n\tIf num_gpus_per_model!=1, global GPU locking is disabled." }, { "output": " More information is available at: https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker#gpu-isolation\n\tNote that gpu selection does not wrap, so gpu_id_start + num_gpus_per_model must be less than the number of visibile GPUs. ``assumed_simultaneous_dt_forks_munging``\n~\n\n.. dropdown:: Assumed/Expected number of munging forks\n\t:open:\n\n\tExpected maximum number of forks, used to ensure datatable doesn't overload system. For actual use beyond this value, system will start to have slow-down issues." }, { "output": " ``max_max_dt_threads_munging``\n\n.. dropdown:: Maximum of threads for datatable for munging\n\t:open:\n\n\tMaximum number of threads for datatable for munging. ``max_dt_threads_munging``\n\n\n.. dropdown:: Max Number of Threads to Use for datatable and OpenBLAS for Munging and Model Training\n\t:open:\n\n\tSpecify the maximum number of threads to use for datatable and OpenBLAS during data munging (applied on a per process basis):\n\n\t- 0 = Use all threads\n\t- -1 = Automatically select number of threads (Default)\n\n``max_dt_threads_readwrite``\n\n\n.. dropdown:: Max Number of Threads to Use for datatable Read and Write of Files\n\t:open:\n\n\tSpecify the maximum number of threads to use for datatable during data reading and writing (applied on a per process basis):\n\n\t- 0 = Use all threads\n\t- -1 = Automatically select number of threads (Default)\n\n``max_dt_threads_stats_openblas``\n~\n\n.. dropdown:: Max Number of Threads to Use for datatable Stats and OpenBLAS\n\t:open:\n\n\tSpecify the maximum number of threads to use for datatable stats and OpenBLAS (applied on a per process basis):\n\n\t- 0 = Use all threads\n\t- -1 = Automatically select number of threads (Default)\n\n.. _allow_reduce_features_when_failure:\n\n``allow_reduce_features_when_failure``\n\n\n.. dropdown:: Whether to reduce features when model fails (GPU OOM Protection)\n\t:open:\n\n\tBig models (on big data or with lot of features) can run out of memory on GPUs." }, { "output": " Currently is applicable to all non-dask XGBoost models (i.e. GLMModel, XGBoostGBMModel, XGBoostDartModel, XGBoostRFModel),during normal fit or when using Optuna. This is acheived by reducing features until model does not fail. For example, If XGBoost runs out of GPU memory, this is detected, and (regardless of setting of skip_model_failures), we perform feature selection using XGBoost on subsets of features. The dataset is progressively reduced by factor of 2 with more models to cover all features." }, { "output": " Then all sub-models are used to estimate variable importance by absolute information gain, in order to decide which features to include. Finally, a single model with the most important features is built using the feature count that did not lead to OOM. Note:\n\n\t- This option is set to 'auto' -> 'on' by default i.e whenever the conditions are favorable, it is set to 'on'. - Reproducibility is not guaranteed when this option is turned on. Hence if user enables reproducibility for the experiment, 'auto' automatically sets this option to 'off'." }, { "output": " - Reduction is only done on features and not on rows for the feature selection step. Also see :ref:`reduce_repeats_when_failure ` and :ref:`fraction_anchor_reduce_features_when_failure `\n\n.. _reduce_repeats_when_failure:\n\n``reduce_repeats_when_failure``\n~\n\n.. dropdown:: Number of repeats for models used for feature selection during failure recovery\n\t:open:\n\n\tWith :ref:`allow_reduce_features_when_failure `, this controls how many repeats of sub-models are used for feature selection." }, { "output": " More repeats can lead to higher accuracy. The cost of this option is proportional to the repeat count. The default value is 1. .. _fraction_anchor_reduce_features_when_failure:\n\n``fraction_anchor_reduce_features_when_failure``\n\n\n.. dropdown:: Fraction of features treated as anchor for feature selection during failure recovery\n\t:open:\n\n\tWith :ref:`allow_reduce_features_when_failure `, this controls the fraction of features treated as an anchor that are fixed for all sub-models." }, { "output": " For tuning and evolution, the probability depends upon any prior importance (if present) from other individuals, while final model uses uniform probability for anchor features. The default fraction is 0.1. ``xgboost_reduce_on_errors_list``\n~\n\n.. dropdown:: Errors From XGBoost That Trigger Reduction of Features\n\t:open:\n\n\tError strings from XGBoost that are used to trigger re-fit on reduced sub-models. See allow_reduce_features_when_failure. ``lightgbm_reduce_on_errors_list``\n\n\n.. dropdown:: Errors From LightGBM That Trigger Reduction of Features\n\t:open:\n\n\tError strings from LightGBM that are used to trigger re-fit on reduced sub-models." }, { "output": " ``num_gpus_per_hyperopt_dask``\n\n\n.. dropdown:: GPUs / HyperOptDask\n\t:open:\n\n\tSpecify the number of GPUs to use per model hyperopt training task. To use all GPUs, set this to -1. For example, when this is set to -1 and there are 4 GPUs available, all of them can be used for the training of a single model across a Dask cluster. Ignored if GPUs are disabled or if there are no GPUs on system. In multinode context, this refers to the per-node value. ``detailed_traces``\n~\n\n.. dropdown:: Enable Detailed Traces\n\t:open:\n\n\tSpecify whether to enable detailed tracing in Driverless AI trace when running an experiment." }, { "output": " The F0.5 score is the weighted harmonic mean of the precision and recall (given a threshold value). Unlike the F1 score, which gives equal weight to precision and recall, the F0.5 score gives more weight to precision than to recall. More weight should be given to precision for cases where False Positives are considered worse than False Negatives. For example, if your use case is to predict which products you will run out of, you may consider False Positives worse than False Negatives. In this case, you want your predictions to be very precise and only capture the products that will definitely run out." }, { "output": " S3 Setup\n\n\nDriverless AI lets you explore S3 data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with S3. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``aws_access_key_id``: The S3 access key ID\n- ``aws_secret_access_key``: The S3 access key\n- ``aws_role_arn``: The Amazon Resource Name\n- ``aws_default_region``: The region to use when the aws_s3_endpoint_url option is not set." }, { "output": " - ``aws_s3_endpoint_url``: The endpoint URL that will be used to access S3. - ``aws_use_ec2_role_credentials``: If set to true, the S3 Connector will try to to obtain credentials associated with the role attached to the EC2 instance. - ``s3_init_path``: The starting S3 path that will be displayed in UI S3 browser. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Example 1: Enable S3 with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n\tThis example enables the S3 data connector and disables authentication." }, { "output": " This allows users to reference data stored in S3 directly using the name node address, for example: s3://name.node/datasets/iris.csv. .. code-block:: bash\n\t :substitutions:\n\n\t nvidia-docker run \\\n\t\t\tshm-size=256m \\\n\t\t\tadd-host name.node:172.16.2.186 \\\n\t\t\t-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,s3\" \\\n\t\t\t-p 12345:12345 \\\n\t\t\tinit -it rm \\\n\t\t\t-v /tmp/dtmp/:/tmp \\\n\t\t\t-v /tmp/dlog/:/log \\\n\t\t\t-v /tmp/dlicense/:/license \\\n\t\t\t-v /tmp/ddata/:/data \\\n\t\t\t-u $(id -u):$(id -g) \\\n\t\t\th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n\tThis example shows how to configure S3 options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, s3\"``\n\n\t2. Mount the config.toml file into the Docker container. .. code-block:: bash\n\t \t :substitutions:\n\n\t\t nvidia-docker run \\\n\t\t \tpid=host \\\n\t\t \tinit \\\n\t\t \trm \\\n\t\t \tshm-size=256m \\\n\t\t \tadd-host name.node:172.16.2.186 \\\n\t\t \t-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n\t\t \t-p 12345:12345 \\\n\t\t \t-v /local/path/to/config.toml:/path/in/docker/config.toml \\\n\t\t \t-v /etc/passwd:/etc/passwd:ro \\\n\t\t \t-v /etc/group:/etc/group:ro \\\n\t\t \t-v /tmp/dtmp/:/tmp \\\n\t\t \t-v /tmp/dlog/:/log \\\n\t\t \t-v /tmp/dlicense/:/license \\\n\t\t \t-v /tmp/ddata/:/data \\\n\t\t \t-u $(id -u):$(id -g) \\\n\t\t \th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n\tThis example enables the S3 data connector and disables authentication." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n\t ::\n\n\t # DEB and RPM\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n\t # TAR SH\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n\t2. Specify the following configuration options in the config.toml file. ::\n\n\t\t# File System Support\n\t\t# upload : standard upload feature\n\t\t# file : local file system/server file system\n\t\t# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n\t\t# dtap : Blue Data Tap file system, remember to configure the DTap section below\n\t\t# s3 : Amazon S3, optionally configure secret and access key below\n\t\t# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n\t\t# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n\t\t# minio : Minio Cloud Storage, remember to configure secret and access key below\n\t\t# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n\t\t# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n\t\t# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n\t\t# jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n\t\t# recipe_url: load custom recipe from URL\n\t\t# recipe_file: load custom recipe from local file system\n\t\tenabled_file_systems = \"file, s3\"\n\n\t3. Save the changes when you are done, then stop/restart Driverless AI. Example 2: Enable S3 with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n\tThis example enables the S3 data connector with authentication by passing an S3 access key ID and an access key. It also configures Docker DNS by passing the name and IP of the S3 name node." }, { "output": " .. code-block:: bash\n\t :substitutions:\n\n\t nvidia-docker run \\\n\t\t\t\tshm-size=256m \\\n\t\t\t\tadd-host name.node:172.16.2.186 \\\n\t\t\t\t-e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,s3\" \\\n\t\t\t\t-e DRIVERLESS_AI_AWS_ACCESS_KEY_ID=\"\" \\\n\t\t\t\t-e DRIVERLESS_AI_AWS_SECRET_ACCESS_KEY=\"\" \\ \n\t\t\t\t-p 12345:12345 \\\n\t\t\t\tinit -it rm \\\n\t\t\t\t-v /tmp/dtmp/:/tmp \\\n\t\t\t\t-v /tmp/dlog/:/log \\\n\t\t\t\t-v /tmp/dlicense/:/license \\\n\t\t\t\t-v /tmp/ddata/:/data \\\n\t\t\t\t-u $(id -u):$(id -g) \\\n\t\t\t\th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n\tThis example shows how to configure S3 options with authentication in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, s3\"``\n\t - ``aws_access_key_id = \"\"``\n\t - ``aws_secret_access_key = \"\"``\n\n\t2. Mount the config.toml file into the Docker container. .. code-block:: bash\n\t \t:substitutions:\n\n\t\t nvidia-docker run \\\n\t\t \tpid=host \\\n\t\t \tinit \\\n\t\t \trm \\\n\t\t \tshm-size=256m \\\n\t\t \tadd-host name.node:172.16.2.186 \\\n\t\t \t-e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n\t\t \t-p 12345:12345 \\\n\t\t \t-v /local/path/to/config.toml:/path/in/docker/config.toml \\\n\t\t \t-v /etc/passwd:/etc/passwd:ro \\\n\t\t \t-v /etc/group:/etc/group:ro \\\n\t\t \t-v /tmp/dtmp/:/tmp \\\n\t\t \t-v /tmp/dlog/:/log \\\n\t\t \t-v /tmp/dlicense/:/license \\\n\t\t \t-v /tmp/ddata/:/data \\\n\t\t \t-u $(id -u):$(id -g) \\\n\t\t \th2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n\tThis example enables the S3 data connector with authentication by passing an S3 access key ID and an access key." }, { "output": " Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n\t ::\n\n\t # DEB and RPM\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n\t # TAR SH\n\t export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n\t2. Specify the following configuration options in the config.toml file. ::\n\n\t\t# File System Support\n\t\t# upload : standard upload feature\n\t\t# file : local file system/server file system\n\t\t# hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n\t\t# dtap : Blue Data Tap file system, remember to configure the DTap section below\n\t\t# s3 : Amazon S3, optionally configure secret and access key below\n\t\t# gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n\t\t# gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n\t\t# minio : Minio Cloud Storage, remember to configure secret and access key below\n\t\t# snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n\t\t# kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n\t\t# azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n\t\t# jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " .. _image-settings:\n\nImage Settings\n\n\n``enable_tensorflow_image``\n~\n.. dropdown:: Enable Image Transformer for Processing of Image Data\n\t:open:\n\n\tSpecify whether to use pretrained deep learning models for processing of image data as part of the feature engineering pipeline. When this is enabled, a column of Uniform Resource Identifiers (URIs) to images is converted to a numeric representation using ImageNet-pretrained deep learning models. This is enabled by default. .. _tensorflow_image_pretrained_models:\n\n``tensorflow_image_pretrained_models``\n\n\n.. dropdown:: Supported ImageNet Pretrained Architectures for Image Transformer\n\t:open:\n\n\tSpecify the supported `ImageNet `__ pretrained architectures for image transformer." }, { "output": " If an internet connection is not available, non-default models must be downloaded from http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/dai_image_models_1_10.zip and extracted into ``tensorflow_image_pretrained_models_dir``. - Multiple transformers can be activated at the same time to allow the selection of multiple options. In this case, embeddings from the different architectures are concatenated together (in a single embedding). ``tensorflow_image_vectorization_output_dimension``\n~\n.. dropdown:: Dimensionality of Feature Space Created by Image Transformer\n\t:open:\n\n\tSpecify the dimensionality of the feature (embedding) space created by Image Transformer." }, { "output": " .. _image-model-fine-tune:\n\n``tensorflow_image_fine_tune``\n\n.. dropdown:: Enable Fine-Tuning of the Pretrained Models Used for the Image Transformer\n\t:open:\n\n\tSpecify whether to enable fine-tuning of the ImageNet pretrained models used for the Image Transformer. This is disabled by default. ``tensorflow_image_fine_tuning_num_epochs``\n~\n.. dropdown:: Number of Epochs for Fine-Tuning Used for the Image Transformer\n\t:open:\n\n\tSpecify the number of epochs for fine-tuning ImageNet pretrained models used for the Image Transformer." }, { "output": " ``tensorflow_image_augmentations``\n\n.. dropdown:: List of Augmentations for Fine-Tuning Used for the Image Transformer\n\t:open:\n\n\tSpecify the list of possible image augmentations to apply while fine-tuning the ImageNet pretrained models used for the Image Transformer. Select from the following:\n\n\t- Blur\n\t- CLAHE\n\t- Downscale\n\t- GaussNoise\n\t- GridDropout\n\t- HorizontalFlip (Default)\n\t- HueSaturationValue\n\t- ImageCompression\n\t- OpticalDistortion\n\t- RandomBrightnessContrast\n\t- RandomRotate90\n\t- ShiftScaleRotate\n\t- VerticalFlip\n\n\tNote: For more information on individual augmentations, see https://albumentations.ai/docs/." }, { "output": " By default, the batch size is set to -1 (selected automatically). Note: Larger architectures and batch sizes use more memory. ``image_download_timeout``\n\n.. dropdown:: Image Download Timeout in Seconds\n\t:open:\n\n\tWhen providing images through URLs, specify the maximum number of seconds to wait for an image to download. This value defaults to 60 sec. ``string_col_as_image_max_missing_fraction``\n\n.. dropdown:: Maximum Allowed Fraction of Missing Values for Image Column\n\t:open:\n\n\tSpecify the maximum allowed fraction of missing elements in a string column for it to be considered as a potential image path." }, { "output": " ``string_col_as_image_min_valid_types_fraction``\n\n.. dropdown:: Minimum Fraction of Images That Need to Be of Valid Types for Image Column to Be Used\n\t:open:\n\n\tSpecify the fraction of unique image URIs that need to have valid endings (as defined by ``string_col_as_image_valid_types``) for a string column to be considered as image data. This value defaults to 0.8. ``tensorflow_image_use_gpu``\n\n.. dropdown:: Enable GPU(s) for Faster Transformations With the Image Transformer\n\t:open:\n\n\tSpecify whether to use any available GPUs to transform images into embeddings with the Image Transformer." }, { "output": " Install on RHEL\n-\n\nThis section describes how to install the Driverless AI Docker image on RHEL. The installation steps vary depending on whether your system has GPUs or if it is CPU only. Environment\n~\n\n+-+-+-+\n| Operating System | GPUs? | Min Mem |\n+=+=+=+\n| RHEL with GPUs | Yes | 64 GB |\n+-+-+-+\n| RHEL with CPUs | No | 64 GB |\n+-+-+-+\n\n.. _install-on-rhel-with-gpus:\n\nInstall on RHEL with GPUs\n~\n\nNote: Refer to the following links for more information about using RHEL with GPUs." }, { "output": " This is necessary in order to prevent a mismatch between the NVIDIA driver and the kernel, which can lead to the GPUs failures. - https://access.redhat.com/solutions/2372971\n - https://www.rootusers.com/how-to-disable-specific-package-updates-in-rhel-centos/\n\nWatch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same." }, { "output": " Open a Terminal and ssh to the machine that will run Driverless AI. Once you are logged in, perform the following steps. 1. Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 2. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.com/engine/installation/linux/docker-ee/rhel/. Alternatively, you can run on Docker CE. .. code-block:: bash\n\n sudo yum install -y yum-utils\n sudo yum-config-manager add-repo https://download.docker.com/linux/centos/docker-ce.repo\n sudo yum makecache fast\n sudo yum -y install docker-ce\n sudo systemctl start docker\n\n3." }, { "output": " More information is available at https://github.com/NVIDIA/nvidia-docker/blob/master/README.md. .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(. /etc/os-release;echo $ID$VERSION_ID)\n curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \\\n sudo tee /etc/apt/sources.list.d/nvidia-docker.list\n sudo apt-get update\n\n # Install nvidia-docker2 and reload the Docker daemon configuration\n sudo apt-get install -y nvidia-docker2\n\n Note: If you would like the nvidia-docker service to automatically start when the server is rebooted then run the following command." }, { "output": " .. code-block:: bash\n\n sudo systemctl enable nvidia-docker\n\n Alternatively, if you have installed Docker CE above you can install nvidia-docker with:\n\n .. code-block:: bash\n\n curl -s -L https://nvidia.github.io/nvidia-docker/centos7/x86_64/nvidia-docker.repo | \\\n sudo tee /etc/yum.repos.d/nvidia-docker.repo\n sudo yum install nvidia-docker2\n\n4. Verify that the NVIDIA driver is up and running. If the driver is not up and running, log on to http://www.nvidia.com/Download/index.aspx?lang=en-us to get the latest NVIDIA Tesla V/P/K series driver." }, { "output": " Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n \n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n6. Change directories to the new folder, then load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the new directory\n cd |VERSION-dir|\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n7." }, { "output": " Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html. .. include:: enable-persistence.rst\n\n8. Set up the data, log, and license directories on the host machine (within the new directory):\n\n .. code-block:: bash\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n9. At this point, you can copy data into the data directory on the host machine." }, { "output": " 10. Run ``docker images`` to find the image tag. 11. Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message. For GPU users, as GPU needs ``pid=host`` for nvml, which makes tini not use pid=1, so it will show the warning message (still harmless)." }, { "output": " But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command. Note: Use ``docker version`` to check which version of Docker you are using. .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n12." }, { "output": " .. _install-on-rhel-cpus-only:\n\nInstall on RHEL with CPUs\n~\n\nThis section describes how to install and start the Driverless AI Docker image on RHEL. Note that this uses ``docker`` and not ``nvidia-docker``. Watch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same. .. note::\n\tAs of this writing, Driverless AI has been tested on RHEL versions 7.4, 8.3, and 8.4." }, { "output": " Once you are logged in, perform the following steps. 1. Install and start Docker EE on RHEL (if not already installed). Follow the instructions on https://docs.docker.com/engine/installation/linux/docker-ee/rhel/. Alternatively, you can run on Docker CE. .. code-block:: bash\n\n sudo yum install -y yum-utils\n sudo yum-config-manager add-repo https://download.docker.com/linux/centos/docker-ce.repo\n sudo yum makecache fast\n sudo yum -y install docker-ce\n sudo systemctl start docker\n\n2." }, { "output": " 3. Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI Docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5. Set up the data, log, license, and tmp directories (within the new directory):\n\n .. code-block:: bash\n :substitutions:\n\n # cd into the directory associated with your version of Driverless AI\n cd |VERSION-dir|\n\n # Set up the data, log, license, and tmp directories on the host machine\n mkdir data\n mkdir log\n mkdir license\n mkdir tmp\n\n6." }, { "output": " The data will be visible inside the Docker container at //data. 7. Run ``docker images`` to find the image tag. 8. Start the Driverless AI Docker image. Note that GPU support will not be available. Note that from version 1.10 DAI docker image runs with internal ``tini`` that is equivalent to using ``init`` from docker, if both are enabled in the launch command, tini will print a (harmless) warning message. We recommend ``shm-size=256m`` in docker launch command. But if user plans to build :ref:`image auto model ` extensively, then ``shm-size=2g`` is recommended for Driverless AI docker command." }, { "output": " HDFS Setup\n\n\nDriverless AI lets you explore HDFS data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with HDFS. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``hdfs_config_path`` (Required): The location the HDFS config folder path." }, { "output": " - ``hdfs_auth_type`` (Required): Specifies the HDFS authentication. Available values are:\n\n - ``principal``: Authenticate with HDFS with a principal user. - ``keytab``: Authenticate with a keytab (recommended). If running DAI as a service, then the Kerberos keytab needs to be owned by the DAI user. - ``keytabimpersonation``: Login with impersonation using a keytab. - ``noauth``: No authentication needed. - ``key_tab_path``: The path of the principal key tab file. This is required when ``hdfs_auth_type='principal'``." }, { "output": " This is required when ``hdfs_auth_type='keytab'``. - ``hdfs_app_jvm_args``: JVM args for HDFS distributions. Separate each argument with spaces. - ``-Djava.security.krb5.conf``\n - ``-Dsun.security.krb5.debug``\n - ``-Dlog4j.configuration``\n\n- ``hdfs_app_classpath``: The HDFS classpath. - ``hdfs_app_supported_schemes``: The list of DFS schemas that is used to check whether a valid input to the connector has been established. For example:\n\n ::\n\n hdfs_app_supported_schemes = ['hdfs://', 'maprfs://', 'custom://']\n\n The following are the default values for this option." }, { "output": " - ``hdfs://``\n - ``maprfs://``\n - ``swift://``\n\n- ``hdfs_max_files_listed``: Specifies the maximum number of files that are viewable in the connector UI. Defaults to 100 files. To view more files, increase the default value. - ``hdfs_init_path``: Specifies the starting HDFS path displayed in the UI of the HDFS browser. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Example 1: Enable HDFS with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the HDFS data connector and disables HDFS authentication." }, { "output": " This lets you reference data stored in HDFS directly using name node address, for example: ``hdfs://name.node/datasets/iris.csv``. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs\" \\\n -e DRIVERLESS_AI_HDFS_AUTH_TYPE='noauth' \\\n -e DRIVERLESS_AI_PROCSY_PORT=8080 \\\n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure HDFS options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed. - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the HDFS data connector and disables HDFS authentication in the config.toml file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. Note that the procsy port, which defaults to 12347, also has to be changed. ::\n\n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI. Example 2: Enable HDFS with Keytab-Based Authentication\n~\n\nNotes: \n\n- If using Kerberos Authentication, then the time on the Driverless AI server must be in sync with Kerberos server. If the time difference between clients and DCs are 5 minutes or higher, there will be Kerberos failures." }, { "output": " .. tabs::\n .. group-tab:: Docker Image Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below. - Configures the environment variable ``DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER`` to reference a user for whom the keytab was created (usually in the form of user@realm). .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs\" \\\n -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytab' \\\n -e DRIVERLESS_AI_KEY_TAB_PATH='tmp/<>' \\\n -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<>' \\\n -e DRIVERLESS_AI_PROCSY_PORT=8080 \\ \n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed. - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n - ``hdfs_auth_type = \"keytab\"``\n - ``key_tab_path = \"/tmp/\"``\n - ``hdfs_app_principal_user = \"\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n \n # IP address and port of procsy process. procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n # HDFS connector\n # Auth type can be Principal/keytab/keytabPrincipal\n # Specify HDFS Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with HDFS with a principal user\n # keytab : Authenticate with a Key tab (recommended)\n # keytabimpersonation : Login with impersonation using a keytab\n hdfs_auth_type = \"keytab\"\n\n # Path of the principal key tab file\n key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n hdfs_app_principal_user = \"\"\n\n 3." }, { "output": " Example 3: Enable HDFS with Keytab-Based Impersonation\n\n\nNotes: \n\n- If using Kerberos, be sure that the Driverless AI time is synched with the Kerberos server. - If running Driverless AI as a service, then the Kerberos keytab needs to be owned by the Driverless AI user. - Logins are case sensitive when keytab-based impersonation is configured. .. tabs::\n .. group-tab:: Docker Image Installs\n\n The example:\n\n - Sets the authentication type to ``keytabimpersonation``. - Places keytabs in the ``/tmp/dtmp`` folder on your machine and provides the file path as described below." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs\" \\\n -e DRIVERLESS_AI_HDFS_AUTH_TYPE='keytabimpersonation' \\\n -e DRIVERLESS_AI_KEY_TAB_PATH='/tmp/<>' \\\n -e DRIVERLESS_AI_HDFS_APP_PRINCIPAL_USER='<>' \\\n -e DRIVERLESS_AI_PROCSY_PORT=8080 \\ \n -p 12345:12345 \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example:\n\n - Sets the authentication type to ``keytabimpersonation``." }, { "output": " - Configures the ``hdfs_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). 1. Configure the Driverless AI config.toml file. Set the following configuration options. Note that the procsy port, which defaults to 12347, also has to be changed. - ``enabled_file_systems = \"file, upload, hdfs\"``\n - ``procsy_ip = \"127.0.0.1\"``\n - ``procsy_port = 8080``\n - ``hdfs_auth_type = \"keytabimpersonation\"``\n - ``key_tab_path = \"/tmp/\"``\n - ``hdfs_app_principal_user = \"\"``\n\n 2." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example:\n\n - Sets the authentication type to ``keytabimpersonation``." }, { "output": " - Configures the ``hdfs_app_principal_user`` variable, which references a user for whom the keytab was created (usually in the form of user@realm). 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file." }, { "output": " procsy_ip = \"127.0.0.1\"\n procsy_port = 8080\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, hdfs\"\n\n # HDFS connector\n # Auth type can be Principal/keytab/keytabPrincipal\n # Specify HDFS Auth Type, allowed options are:\n # noauth : No authentication needed\n # principal : Authenticate with HDFS with a principal user\n # keytab : Authenticate with a Key tab (recommended)\n # keytabimpersonation : Login with impersonation using a keytab\n hdfs_auth_type = \"keytabimpersonation\"\n\n # Path of the principal key tab file\n key_tab_path = \"/tmp/\"\n\n # Kerberos app principal user (recommended)\n hdfs_app_principal_user = \"\"\n\n 3." }, { "output": " Specifying a Hadoop Platform\n\n\nThe following example shows how to build an H2O-3 Hadoop image and run Driverless AI. This example uses CDH 6.0. Change the ``H2O_TARGET`` to specify a different platform. 1. Clone and then build H2O-3 for CDH 6.0. .. code-block:: bash\n\n git clone https://github.com/h2oai/h2o-3.git\n cd h2o-3\n ./gradlew clean build -x test\n export H2O_TARGET=cdh6.0\n export BUILD_HADOOP=true\n ./gradlew clean build -x test\n\n2. Start H2O. .. code-block:: bash\n\n docker run -it rm \\\n -v `pwd`:`pwd` \\\n -w `pwd` \\\n entrypoint bash \\\n network=host \\\n -p 8020:8020 \\\n docker.h2o.ai/cdh-6-w-hive \\\n -c 'sudo -E startup.sh && \\\n source /envs/h2o_env_python3.8/bin/activate && \\\n hadoop jar h2o-hadoop-3/h2o-cdh6.0-assembly/build/libs/h2odriver.jar -libjars \"$(cat /opt/hive-jars/hive-libjars)\" -n 1 -mapperXmx 2g -baseport 54445 -notify h2o_one_node -ea -disown && \\\n export CLOUD_IP=localhost && \\\n export CLOUD_PORT=54445 && \\\n make -f scripts/jenkins/Makefile.jenkins test-hadoop-smoke; \\\n bash'\n\n3." }, { "output": " .. _running-docker-on-gce:\n\nInstall and Run in a Docker Container on Google Compute Engine\n\n\nThis section describes how to install and start Driverless AI from scratch using a Docker container in a Google Compute environment. This installation assumes that you already have a Google Cloud Platform account. If you don't have an account, go to https://console.cloud.google.com/getting-started to create one. In addition, refer to Google's `Machine Types documentation `__ for information on Google Compute machine types." }, { "output": " Note that some of the images in this video may change between releases, but the installation steps remain the same. Before You Begin\n\n\nIf you are trying GCP for the first time and have just created an account, check your Google Compute Engine (GCE) resource quota limits. By default, GCP allocates a maximum of 8 CPUs and no GPUs. You can change these settings to match your quota limit, or you can request more resources from GCP. Refer to https://cloud.google.com/compute/quotas for more information, including information on how to check your quota and request additional quota." }, { "output": " In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/. 2. In the left navigation panel, select Compute Engine > VM Instances. .. image:: ../images/gce_newvm_instance.png\n :align: center\n :height: 390\n :width: 400\n\n3. Click Create Instance. .. image:: ../images/gce_create_instance.png\n :align: center\n\n4. Specify the following at a minimum:\n\n - A unique name for this instance. - The desired `zone `__." }, { "output": " Refer to the following for information on how to add GPUs: https://cloud.google.com/compute/docs/gpus/. - A supported OS, for example Ubuntu 16.04. Be sure to also increase the disk size of the OS image to be 64 GB. Click Create at the bottom of the form when you are done. This creates the new VM instance. .. image:: ../images/gce_instance_settings.png\n :align: center\n :height: 446\n :width: 380\n\n5. Create a Firewall rule for Driverless AI. On the Google Cloud Platform left navigation panel, select VPC network > Firewall rules." }, { "output": " - Change the Targets dropdown to All instances in the network. - Specify the Source IP ranges to be ``0.0.0.0/0``. - Under Protocols and Ports, select Specified protocols and ports and enter the following: ``tcp:12345``. Click Create at the bottom of the form when you are done. .. image:: ../images/gce_create_firewall_rule.png\n :align: center\n :height: 452\n :width: 477\n\n6. On the VM Instances page, SSH to the new VM Instance by selecting Open in Browser Window from the SSH dropdown. .. image:: ../images/gce_ssh_in_browser.png\n :align: center\n\n7." }, { "output": " Open an editor in the VM instance (for example, vi). Copy one of the scripts below (depending on whether you are running GPUs or CPUs). Save the script as install.sh. .. code-block:: bash\n\n # SCRIPT FOR GPUs ONLY\n apt-get -y update \n apt-get -y no-install-recommends install \\\n curl \\\n apt-utils \\\n python-software-properties \\\n software-properties-common\n\n add-apt-repository -y ppa:graphics-drivers/ppa\n add-apt-repository -y \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n curl -fsSL https://download.docker.com/linux/ubuntu/gpg | apt-key add - \n\n apt-get update \n apt-get install -y \\ \n nvidia-384 \\\n nvidia-modprobe \\\n docker-ce\n\n curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \\\n sudo apt-key add -\n distribution=$(." }, { "output": " Type the following commands to run the install script. .. code-block:: bash\n\n chmod +x install.sh\n sudo ./install.sh\n\n9. In your user folder, create the following directories as your user. .. code-block:: bash\n\n mkdir ~/tmp\n mkdir ~/log\n mkdir ~/data\n mkdir ~/scripts\n mkdir ~/license\n mkdir ~/demo\n mkdir -p ~/jupyter/notebooks\n\n10. Add your Google Compute user name to the Docker container. .. code-block:: bash\n\n sudo usermod -aG docker \n\n\n11. Reboot the system to enable NVIDIA drivers." }, { "output": " Retrieve the Driverless AI Docker image from https://www.h2o.ai/download/. 13. Load the Driverless AI Docker image. The following example shows how to load Driverless AI. Replace VERSION with your image. .. code-block:: bash\n :substitutions:\n\n sudo docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n14. If you are running CPUs, you can skip this step. Otherwise, you must enable persistence of the GPU. Note that this needs to be run once every reboot. Refer to the following for more information: http://docs.nvidia.com/deploy/driver-persistence/index.html." }, { "output": " Start the Driverless AI Docker image and replace TAG below with the image tag. Depending on your install version, use the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command. Refer to :ref:`Data Connectors` for information on how to add the GCS and GBQ data connectors to your Driverless AI instance. Note: Use ``docker version`` to check which version of Docker you are using. .. tabs::\n\n .. tab:: >= Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n docker run runtime=nvidia \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. tab:: < Docker 19.03\n\n .. code-block:: bash\n :substitutions:\n\n # Start the Driverless AI Docker image\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n Driverless AI will begin running::\n\n \n Welcome to H2O.ai's Driverless AI\n -\n\n - Put data in the volume mounted at /data\n - Logs are written to the volume mounted at /log/20180606-044258\n - Connect to Driverless AI on port 12345 inside the container\n - Connect to Jupyter notebook on port 8888 inside the container\n\n16." }, { "output": " Azure Blob Store Setup\n \n\nDriverless AI lets you explore Azure Blob Store data sources from within the Driverless AI application. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Supported Data Sources Using the Azure Blob Store Connector\n~\n\nThe following data sources can be used with the Azure Blob Store connector." }, { "output": " - :ref:`Azure Data Lake Gen 1 (HDFS connector required)`\n- :ref:`Azure Data Lake Gen 2 (HDFS connector optional)`\n\n\nDescription of Configuration Attributes\n~\n\nThe following configuration attributes are specific to enabling Azure Blob Storage. - ``azure_blob_account_name``: The Microsoft Azure Storage account name. This should be the dns prefix created when the account was created (for example, \"mystorage\"). - ``azure_blob_account_key``: Specify the account key that maps to your account name." }, { "output": " With this option, you can include an override for a host, port, and/or account name. For example, \n\n .. code:: bash\n\n azure_connection_string = \"DefaultEndpointsProtocol=http;AccountName=;AccountKey=;BlobEndpoint=http://:/;\"\n\n- ``azure_blob_init_path``: Specifies the starting Azure Blob store path displayed in the UI of the Azure Blob store browser. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly." }, { "output": " - ``hdfs_config_path``: The location the HDFS config folder path. This folder can contain multiple config files. - ``hdfs_app_classpath``: The HDFS classpath. - ``hdfs_app_supported_schemes``: Supported schemas list is used as an initial check to ensure valid input to connector. .. _example1:\n\nExample 1: Enabling the Azure Blob Store Data Connector\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the Azure Blob Store data connector by specifying environment variables when starting the Driverless AI Docker image." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,azrbs\" \\\n -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_NAME=\"mystorage\" \\\n -e DRIVERLESS_AI_AZURE_BLOB_ACCOUNT_KEY=\"\" \\\n -p 12345:12345 \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure Azure Blob Store options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, azrbs\"``\n - ``azure_blob_account_name = \"mystorage\"``\n - ``azure_blob_account_key = \"\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example shows how to enable the Azure Blob Store data connector in the config.toml file when starting Driverless AI in native installs." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, azrbs\"\n\n # Azure Blob Store Connector credentials\n azure_blob_account_name = \"mystorage\"\n azure_blob_account_key = \"\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI. .. _example2:\n\nExample 2: Mount Azure File Shares to the Local File System\n~\n\nSupported Data Sources Using the Local File System\n\n\n- Azure Files (File Storage) \n\nMounting Azure File Shares\n\n\nAzure file shares can be mounted into the Local File system of Driverless AI." }, { "output": " .. _example3:\n\nExample 3: Enable HDFS Connector to Connect to Azure Data Lake Gen 1\n~\n\nThis example enables the HDFS Connector to connect to Azure Data Lake Gen1. This lets users reference data stored on your Azure Data Lake using the adl uri, for example: ``adl://myadl.azuredatalakestore.net``. .. tabs::\n .. group-tab:: Docker Image with the config.toml\n\n 1. Create an Azure AD web application for service-to-service authentication: https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n\n 2." }, { "output": " Take note of the Hadoop Classpath and add the ``azure-datalake-store.jar`` file. This file can found on any Hadoop version in: ``$HADOOP_HOME/share/hadoop/tools/lib/*``. .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n\n 4. Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['adl://']\"\n \n 5." }, { "output": " .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory\n\n 2. Add the information from your web application to the hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.adl.oauth2.access.token.provider.type\n ClientCredential\n \n \n fs.adl.oauth2.refresh.url\n Token endpoint created in step 1.\n \n \n fs.adl.oauth2.client.id\n Client ID created in step 1\n \n \n fs.adl.oauth2.credential\n Client Secret created in step 1\n \n \n fs.defaultFS\n ADL URIt\n \n \n\n 3." }, { "output": " This file can found on any hadoop version in: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n\n 4. Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['adl://']\"\n \n 5." }, { "output": " .. _example4:\n\nExample 4: Enable HDFS Connector to Connect to Azure Data Lake Gen 2\n\n\nThis example enables the HDFS Connector to connect to Azure Data Lake Gen2. This lets users reference data stored on your Azure Data Lake using the Azure Blob File System Driver, for example: ``abfs[s]://file_system@account_name.dfs.core.windows.net///``. .. tabs::\n .. group-tab:: Docker Image with the config.toml\n\n 1. Create an Azure Service Principal: https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal\n\n 2." }, { "output": " Add the information from your web application to the Hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.azure.account.auth.type\n OAuth\n \n \n fs.azure.account.oauth.provider.type\n org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\n \n \n fs.azure.account.oauth2.client.endpoint\n Token endpoint created in step 1.\n \n \n fs.azure.account.oauth2.client.id\n Client ID created in step 1\n \n \n fs.azure.account.oauth2.client.secret\n Client Secret created in step 1\n \n \n\n 4." }, { "output": " These files can found on any Hadoop version 3.2 or higher at: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n\n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n \n Note: ABFS is only supported for Hadoop version 3.2 or higher. 5. Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n\n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['abfs://']\"\n \n 6." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal\n\n 2. Grant permissions to the Service Principal created on step 1 to access blobs: https://docs.microsoft.com/en-us/azure/storage/common/storage-auth-aad\n\n 3. Add the information from your web application to the hadoop ``core-site.xml`` configuration file:\n\n .. code:: bash\n\n \n \n fs.azure.account.auth.type\n OAuth\n \n \n fs.azure.account.oauth.provider.type\n org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider\n \n \n fs.azure.account.oauth2.client.endpoint\n Token endpoint created in step 1.\n \n \n fs.azure.account.oauth2.client.id\n Client ID created in step 1\n \n \n fs.azure.account.oauth2.client.secret\n Client Secret created in step 1\n \n \n\n 4." }, { "output": " These files can found on any hadoop version 3.2 or higher at: ``$HADOOP_HOME/share/hadoop/tools/lib/*``\n\n .. code:: bash \n \n echo \"$HADOOP_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/*\"\n \n Note: ABFS is only supported for hadoop version 3.2 or higher \n\n 5. Configure the Driverless AI config.toml file. Set the following configuration options: \n\n .. code:: bash\n \n enabled_file_systems = \"upload, file, hdfs, azrbs, recipe_file, recipe_url\"\n hdfs_config_path = \"/path/to/hadoop/conf\"\n hdfs_app_classpath = \"/hadoop/classpath/\"\n hdfs_app_supported_schemes = \"['abfs://']\"\n \n 6." }, { "output": " Export MOJO artifact to Azure Blob Storage\n\n\nIn order to export the MOJO artifact to Azure Blob Storage, you must enable support for the shared access signatures (SAS) token. You can enable support for the SAS token by setting the following variables in the ``config.toml`` file:\n\n\n1. ``enable_artifacts_upload=true``\n2. ``artifacts_store=\"azure\"``\n3. ``artifacts_azure_sas_token=\"token\"``\n\nFor instructions on exporting artifacts, see :ref:`export_artifacts`. FAQ\n\n\nCan I connect to my storage account using Private Endpoints?" }, { "output": " .. _recipes-settings:\n\nRecipes Settings\n\n\n.. _included_transformers:\n\n``included_transformers``\n\n\n.. dropdown:: Include Specific Transformers\n\t:open:\n\n\tSelect the :ref:`transformer(s) ` that you want to use in the experiment. Use the Check All/Uncheck All button to quickly add or remove all transfomers at once. Note: If you uncheck all transformers so that none is selected, Driverless AI will ignore this and will use the default list of transformers for that experiment. This list of transformers will vary for each experiment." }, { "output": " .. _included_models:\n\n``included_models``\n~\n\n.. dropdown:: Include Specific Models\n\t:open:\n\n\tSpecify the types of models that you want Driverless AI to build in the experiment. This list includes natively supported algorithms and models added with custom recipes. Note: The ImbalancedLightGBM and ImbalancedXGBoostGBM models are closely tied with the :ref:`sampling_method_for_imbalanced` option. Specifically:\n\n\t - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are ENABLED and the :ref:`sampling_method_for_imbalanced` is ENABLED (set to a value other than off), then Driverless AI will check your target imbalance fraction." }, { "output": " - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are DISABLED and the :ref:`sampling_method_for_imbalanced` option is ENABLED, then no special sampling technique will be performed. - If the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are ENABLED and the :ref:`sampling_method_for_imbalanced` is DISABLED, sampling will not be used, and these imbalanced models will be disabled. ``included_scorers``\n\n\n.. dropdown:: Include Specific Scorers\n\t:open:\n\n\tSpecify the scorer(s) that you want Driverless AI to include when running the experiment." }, { "output": " Preprocessing transformers can take any original features and output arbitrary features that are used by the normal layer of transformers. Notes:\n\n\t- Preprocessing transformers and all other layers of transformers are part of the Python and (if applicable) MOJO scoring packages. - Any :ref:`custom transformer recipe ` or native DAI transformer can be used as a preprocessing transformer. For example, a preprocessing transformer can perform interactions, string concatenations, or date extractions as a preprocessing step before the next layer of Date and DateTime transformations are performed." }, { "output": " However, one can use a run-time data recipe to (e.g.) convert a float date-time into string date-time, and this will be used by Driverless AIs Date and DateTime transformers as well as auto-detection of time series. 2) in order to do a time series experiment with the GUI/client auto-selecting groups, periods, etc. the dataset\n\t must have time column and groups prepared ahead of experiment by user or via a one-time :ref:`data recipe `. The equivalent config.toml parameter is ``included_pretransformers``." }, { "output": " This value defaults to 1. The equivalent config.toml parameter is ``num_pipeline_layers``. Note: This does not include the preprocessing layer specified by the :ref:`included_pretransformers` expert setting. .. _included_datas:\n\n``included_datas``\n\n\n.. dropdown:: Include Specific Data Recipes During Experiment\n\t:open:\n\n\tSpecify whether to include specific data recipes during the experiment. Avoids need for separate data preparation step, builds data preparation within experiment and within python scoring package." }, { "output": " The equivalent config.toml parameter is ``included_datas``. .. _included_individuals:\n\n``included_individuals``\n\n\n.. dropdown:: Include Specific Individuals\n\t:open:\n\n\tIn Driverless AI, every completed experiment automatically generates Python code for the experiment that corresponds to the individual(s) used to build the final model. You can edit this auto-generated Python code offline and upload it as a recipe, or edit and save it using the built-in :ref:`custom recipe management editor `." }, { "output": " This expert setting lets you do one of the following:\n\n\t- Leave this field empty to have all individuals be freshly generated and treated by DAI's AutoML as a container of model and transformer choices. - Select recipe display names of custom individuals through the UI. If the number of included custom individuals is less than DAI needs, then the remaining individuals are freshly generated. The equivalent config.toml parameter is ``included_individuals``. For more information, see :ref:`individual_recipe`." }, { "output": " Select from the following:\n\n\t- Auto (Default): Use this option to sync the threshold scorer with the scorer used for the experiment. If this is not possible, F1 is used. - F05 More weight on precision, less weight on recall. - F1: Equal weight on precision and recall. - F2: Less weight on precision, more weight on recall. - MCC: Use this option when all classes are equally important. ``prob_add_genes``\n\n\n.. dropdown:: Probability to Add Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to add genes or instances of transformers with specific attributes." }, { "output": " This value defaults to 0.5. ``prob_addbest_genes``\n\n\n.. dropdown:: Probability to Add Best Shared Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to add genes or instances of transformers with specific attributes that have shown to be beneficial to other individuals within the population. This value defaults to 0.5. ``prob_prune_genes``\n\n\n.. dropdown:: Probability to Prune Transformers\n\t:open:\n\n\tSpecify the unnormalized probability to prune genes or instances of transformers with specific attributes." }, { "output": " ``prob_perturb_xgb``\n\n\n.. dropdown:: Probability to Mutate Model Parameters\n\t:open:\n\n\tSpecify the unnormalized probability to change model hyper parameters. This value defaults to 0.25. ``prob_prune_by_features``\n\n\n.. dropdown:: Probability to Prune Weak Features\n\t:open:\n\n\tSpecify the unnormalized probability to prune features that have low variable importance instead of pruning entire instances of genes/transformers. This value defaults to 0.25. ``skip_transformer_failures``\n~\n\n.. dropdown:: Whether to Skip Failures of Transformers\n\t:open:\n\n\tSpecify whether to avoid failed transformers." }, { "output": " ``skip_model_failures``\n~\n\n.. dropdown:: Whether to Skip Failures of Models\n\t:open:\n\n\tSpecify whether to avoid failed models. Failures are logged according to the specified level for logging skipped failures. This is enabled by default. ``detailed_skip_failure_messages_level``\n\n\n.. dropdown:: Level to Log for Skipped Failures\n\t:open:\n\n\tSpecify one of the following levels for the verbosity of log failure messages for skipped transformers or models:\n\n\t- 0 = Log simple message\n\t- 1 = Log code line plus message (Default)\n\t- 2 = Log detailed stack traces\n\n``notify_failures``\n~\n\n.. dropdown:: Whether to Notify About Failures of Transformers or Models or Other Recipe Failures\n\t:open:\n\n\tSpecify whether to display notifications in the GUI about recipe failures." }, { "output": " .. _install-gcp-offering:\n\nInstall the Google Cloud Platform Offering\n\n\nThis section describes how to install and start Driverless AI in a Google Compute environment using the GCP Marketplace. This assumes that you already have a Google Cloud Platform account. If you don't have an account, go to https://console.cloud.google.com/getting-started to create one. Before You Begin\n\n\nIf you are trying GCP for the first time and have just created an account, check your Google Compute Engine (GCE) resource quota limits." }, { "output": " Our default recommendation for launching Driverless AI is 32 CPUs, 120 GB RAM, and 2 P100 NVIDIA GPUs. You can change these settings to match your quota limit, or you can request more resources from GCP. Refer to https://cloud.google.com/compute/quotas for more information, including information on how to check your quota and request additional quota. Installation Procedure\n\n\n1. In your browser, log in to the Google Compute Engine Console at https://console.cloud.google.com/. 2. In the left navigation panel, select Marketplace." }, { "output": " On the Marketplace page, search for Driverless and select the H2O.ai Driverless AI offering. The following page will display. .. image:: ../images/google_driverlessai_offering.png\n :align: center\n\n4. Click Launch on Compute Engine. (If necessary, refer to `Google Compute Instance Types `__ for information about machine and GPU types.) - Select a zone that has p100s or k80s (such as us-east1-)\n - Optionally change the number of cores and amount of memory." }, { "output": " - Specify a GPU type. (This defaults to a p100 GPU.) - Optionally change the number of GPUs. (Default is 2.) - Specify the boot disk type and size. - Optionally change the network name and subnetwork names. Be sure that whichever network you specify has port 12345 exposed. - Click Deploy when you are done. Driverless AI will begin deploying. Note that this can take several minutes. .. image:: ../images/google_deploy_compute_engine.png\n :align: center\n\n5. A summary page displays when the compute engine is successfully deployed." }, { "output": " Click on the Instance link to retrieve the external IP address for starting Driverless AI. .. image:: ../images/google_deploy_summary.png\n :align: center\n\n6. In your browser, go to https://[External_IP]:12345 to start Driverless AI. 7. Agree to the Terms and Conditions. 8. Log in to Driverless AI using your user name and password. 9. Optionally enable GCS and Big Query access. a. In order to enable GCS and Google BigQuery access, you must pass the running instance a service account json file configured with GCS and GBQ access." }, { "output": " Obtain a functioning service account json file from `GCP `__, rename it to \"service_account.json\", and copy it to the Ubuntu user on the running instance. .. code-block:: bash\n\n gcloud compute scp /path/to/service_account.json ubuntu@:service_account.json\n\n b. SSH into the machine running Driverless AI, and verify that the service_account.json file is in the /etc/dai/ folder. c. Restart the machine for the changes to take effect." }, { "output": " .. _time-series-settings:\n\nTime Series Settings\n\n\n.. _time-series-lag-based-recipe:\n\n``time_series_recipe``\n\n.. dropdown:: Time-Series Lag-Based Recipe\n\t:open:\n\n\tThis recipe specifies whether to include Time Series lag features when training a model with a provided (or autodetected) time column. This is enabled by default. Lag features are the primary automatically generated time series features and represent a variable's past values. At a given sample with time stamp :math:`t`, features at some time difference :math:`T` (lag) in the past are considered." }, { "output": " Lags can be created on any feature as well as on the target. Lagging variables are important in time series because knowing what happened in different time periods in the past can greatly facilitate predictions for the future. Note: Ensembling is disabled when the lag-based recipe with time columns is activated because it only supports a single final model. Ensembling is also disabled if a time column is selected or if time column is set to [Auto] on the experiment setup screen. More information about time series lag is available in the :ref:`time-series-use-case` section." }, { "output": " Note that it's possible to rerun another such diverse leaderboard on top of the best-performing model(s), which will effectively help you compose these expert settings. - 'sliding_window': If the forecast horizon is N periods, create a separate model for \"each of the (gap, horizon) pairs of (0,n), (n,n), (2*n,n), ..., (2*N-1, n) in units of time periods. The number of periods to predict per model n is controlled by the expert setting ``time_series_leaderboard_periods_per_model``, which defaults to 1." }, { "output": " ``time_series_leaderboard_periods_per_model``\n~\n.. dropdown:: Number of periods per model if time_series_leaderboard_mode is 'sliding_window'\n\t:open:\n\n\tSpecify the number of periods per model if ``time_series_leaderboard_mode`` is set to ``sliding_window``. Larger values lead to fewer models. .. _time_series_merge_splits:\n\n``time_series_merge_splits``\n\n.. dropdown:: Larger Validation Splits for Lag-Based Recipe\n\t:open:\n\n\tSpecify whether to create larger validation splits that are not bound to the length of the forecast horizon." }, { "output": " This is enabled by default. ``merge_splits_max_valid_ratio``\n\n.. dropdown:: Maximum Ratio of Training Data Samples Used for Validation\n\t:open:\n\n\tSpecify the maximum ratio of training data samples used for validation across splits when larger validation splits are created (see :ref:`time_series_merge_splits` setting). The default value (-1) will set the ratio automatically depending on the total amount of validation splits. .. _fixed_size_splits:\n\n``fixed_size_splits``\n~\n.. dropdown:: Fixed-Size Train Timespan Across Splits\n\t:open:\n\n\tSpecify whether to keep a fixed-size train timespan across time-based splits during internal validation." }, { "output": " This is disabled by default. ``time_series_validation_fold_split_datetime_boundaries``\n~\n.. dropdown:: Custom Validation Splits for Time-Series Experiments\n\t:open:\n\n\tSpecify date or datetime timestamps (in the same format as the time column) to use for custom training and validation splits. ``timeseries_split_suggestion_timeout``\n~\n.. dropdown:: Timeout in Seconds for Time-Series Properties Detection in UI\n\t:open:\n\n\tSpecify the timeout in seconds for time-series properties detection in Driverless AI's user interface." }, { "output": " .. _holiday-calendar:\n\n``holiday_features``\n\n.. dropdown:: Generate Holiday Features\n\t:open:\n\n\tFor time-series experiments, specify whether to generate holiday features for the experiment. This is enabled by default. ``holiday_countries``\n~\n.. dropdown:: Country code(s) for holiday features\n\t:open:\n\n\tSpecify country codes in the form of a list that is used to look up holidays. Note: This setting is for migration purposes only. ``override_lag_sizes``\n\n.. dropdown:: Time-Series Lags Override\n\t:open:\n\n\tSpecify the override lags to be used." }, { "output": " The following examples show the variety of different methods that can be used to specify override lags:\n\n\t- \"[0]\" disable lags\n\t- \"[7, 14, 21]\" specifies this exact list\n\t- \"21\" specifies every value from 1 to 21\n\t- \"21:3\" specifies every value from 1 to 21 in steps of 3\n\t- \"5-21\" specifies every value from 5 to 21\n\t- \"5-21:3\" specifies every value from 5 to 21 in steps of 3\n\n``override_ufapt_lag_sizes``\n\n.. dropdown:: Lags Override for Features That are not Known Ahead of Time\n\t:open:\n\n\tSpecify lags override for non-target features that are not known ahead of time." }, { "output": " - \"[0]\" disable lags\n\t- \"[7, 14, 21]\" specifies this exact list\n\t- \"21\" specifies every value from 1 to 21\n\t- \"21:3\" specifies every value from 1 to 21 in steps of 3\n\t- \"5-21\" specifies every value from 5 to 21\n\t- \"5-21:3\" specifies every value from 5 to 21 in steps of 3\n\n``min_lag_size``\n\n.. dropdown:: Smallest Considered Lag Size\n\t:open:\n\n\tSpecify a minimum considered lag size. This value defaults to -1. ``allow_time_column_as_feature``\n\n.. dropdown:: Enable Feature Engineering from Time Column\n\t:open:\n\n\tSpecify whether to enable feature engineering based on the selected time column, e.g." }, { "output": " This is enabled by default. ``allow_time_column_as_numeric_feature``\n\n.. dropdown:: Allow Integer Time Column as Numeric Feature\n\t:open:\n\n\tSpecify whether to enable feature engineering from an integer time column. Note that if you are using a time series recipe, using a time column (numeric time stamps) as an input feature can lead to a model that memorizes the actual timestamps instead of features that generalize to the future. This is disabled by default. ``datetime_funcs``\n\n.. dropdown:: Allowed Date and Date-Time Transformations\n\t:open:\n\n\tSpecify the date or date-time transformations to allow Driverless AI to use." }, { "output": " Note that ``get_num`` can lead to overfitting if used on IID problems and is disabled by default. .. _filter_datetime_funcs:\n\n``filter_datetime_funcs``\n~\n.. dropdown:: Auto Filtering of Date and Date-Time Transformations\n\t:open:\n\n\tWhether to automatically filter out date and date-time transformations that would lead to unseen values in the future. This is enabled by default. ``allow_tgc_as_features``\n~\n.. dropdown:: Consider Time Groups Columns as Standalone Features\n\t:open:\n\n\tSpecify whether to consider time groups columns as standalone features." }, { "output": " ``allowed_coltypes_for_tgc_as_features``\n\n.. dropdown:: Which TGC Feature Types to Consider as Standalone Features\n\t:open:\n\n\tSpecify whether to consider time groups columns (TGC) as standalone features. If \"Consider time groups columns as standalone features\" is enabled, then specify which TGC feature types to consider as standalone features. Available types are numeric, categorical, ohe_categorical, datetime, date, and text. All types are selected by default. Note that \"time_column\" is treated separately via the \"Enable Feature Engineering from Time Column\" option." }, { "output": " ``enable_time_unaware_transformers``\n\n.. dropdown:: Enable Time Unaware Transformers\n\t:open:\n\n\tSpecify whether various transformers (clustering, truncated SVD) are enabled, which otherwise would be disabled for time series experiments due to the potential to overfit by leaking across time within the fit of each fold. This is set to Auto by default. ``tgc_only_use_all_groups``\n~\n.. dropdown:: Always Group by All Time Groups Columns for Creating Lag Features\n\t:open:\n\n\tSpecify whether to group by all time groups columns for creating lag features, instead of sampling from them." }, { "output": " ``tgc_allow_target_encoding``\n~\n.. dropdown:: Allow Target Encoding of Time Groups Columns\n\t:open:\n\n\tSpecify whether it is allowed to target encode the time groups columns. This is disabled by default. Notes:\n\n\t- This setting is not affected by ``allow_tgc_as_features``. - Subgroups can be encoded by disabling ``tgc_only_use_all_groups``. ``time_series_holdout_preds``\n~\n.. dropdown:: Generate Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to create diagnostic holdout predictions on training data using moving windows." }, { "output": " This can be useful for MLI, but it will slow down the experiment considerably when enabled. Note that the model itself remains unchanged when this setting is enabled. ``time_series_validation_splits``\n~\n.. dropdown:: Number of Time-Based Splits for Internal Model Validation\n\t:open:\n\n\tSpecify a fixed number of time-based splits for internal model validation. Note that the actual number of allowed splits can be less than the specified value, and that the number of allowed splits is determined at the time an experiment is run." }, { "output": " ``time_series_splits_max_overlap``\n\n.. dropdown:: Maximum Overlap Between Two Time-Based Splits\n\t:open:\n\n\tSpecify the maximum overlap between two time-based splits. The amount of possible splits increases with higher values. This value defaults to 0.5. ``time_series_max_holdout_splits``\n\n.. dropdown:: Maximum Number of Splits Used for Creating Final Time-Series Model's Holdout Predictions\n\t:open:\n\n\tSpecify the maximum number of splits used for creating the final time-series Model's holdout predictions." }, { "output": " Use \t``time_series_validation_splits`` to control amount of time-based splits used for model validation. ``mli_ts_fast_approx``\n\n.. dropdown:: Whether to Speed up Calculation of Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to speed up time-series holdout predictions for back-testing on training data. This setting is used for MLI and calculating metrics. Note that predictions can be slightly less accurate when this setting is enabled. This is disabled by default. ``mli_ts_fast_approx_contribs``\n~\n.. dropdown:: Whether to Speed up Calculation of Shapley Values for Time-Series Holdout Predictions\n\t:open:\n\n\tSpecify whether to speed up Shapley values for time-series holdout predictions for back-testing on training data." }, { "output": " Note that predictions can be slightly less accurate when this setting is enabled. This is enabled by default. ``mli_ts_holdout_contribs``\n~\n.. dropdown:: Generate Shapley Values for Time-Series Holdout Predictions at the Time of Experiment\n\t:open:\n\n\tSpecify whether to enable the creation of Shapley values for holdout predictions on training data using moving windows at the time of the experiment. This can be useful for MLI, but it can slow down the experiment when enabled. If this setting is disabled, MLI will generate Shapley values on demand." }, { "output": " ``time_series_min_interpretability``\n\n.. dropdown:: Lower Limit on Interpretability Setting for Time-Series Experiments (Implicitly Enforced)\n\t:open:\n\n\tSpecify the lower limit on interpretability setting for time-series experiments. Values of 5 (default) or more can improve generalization by more aggressively dropping the least important features. To disable this setting, set this value to 1. ``lags_dropout``\n\n.. dropdown:: Dropout Mode for Lag Features\n\t:open:\n\n\tSpecify the dropout mode for lag features in order to achieve an equal n.a." }, { "output": " Independent mode performs a simple feature-wise dropout. Dependent mode takes the lag-size dependencies per sample/row into account. Dependent is enabled by default. ``prob_lag_non_targets``\n\n.. dropdown:: Probability to Create Non-Target Lag Features\n\t:open:\n\n\tLags can be created on any feature as well as on the target. Specify a probability value for creating non-target lag features. This value defaults to 0.1. .. _rolling-test-set-method:\n\n``rolling_test_method``\n~\n.. dropdown:: Method to Create Rolling Test Set Predictions\n\t:open:\n\n\tSpecify the method used to create rolling test set predictions." }, { "output": " TTA is enabled by default. Notes: \n\t\n\t- This setting only applies to the test set that is provided by the user during an experiment. - This setting only has an effect if the provided test set spans more periods than the forecast horizon and if the target values of the test set are known. ``fast_tta_internal``\n~\n.. dropdown:: Fast TTA for Internal Validation\n\t:open:\n\n\tSpecify whether the genetic algorithm applies Test Time Augmentation (TTA) in one pass instead of using rolling windows for validation splits longer than the forecast horizon." }, { "output": " ``prob_default_lags``\n~\n.. dropdown:: Probability for New Time-Series Transformers to Use Default Lags\n\t:open:\n\n\tSpecify the probability for new lags or the EWMA gene to use default lags. This is determined independently of the data by frequency, gap, and horizon. This value defaults to 0.2. ``prob_lagsinteraction``\n\n.. dropdown:: Probability of Exploring Interaction-Based Lag Transformers\n\t:open:\n\n\tSpecify the unnormalized probability of choosing other lag time-series transformers based on interactions." }, { "output": " ``prob_lagsaggregates``\n~\n.. dropdown:: Probability of Exploring Aggregation-Based Lag Transformers\n\t:open:\n\n\tSpecify the unnormalized probability of choosing other lag time-series transformers based on aggregations. This value defaults to 0.2. .. _centering-detrending:\n\n``ts_target_trafo``\n~\n.. dropdown:: Time Series Centering or Detrending Transformation\n\t:open:\n\n\tSpecify whether to use centering or detrending transformation for time series experiments. Select from the following:\n\n\t- None (Default)\n\t- Centering (Fast)\n\t- Centering (Robust)\n\t- Linear (Fast)\n\t- Linear (Robust)\n\t- Logistic\n\t- Epidemic (Uses the `SEIRD `_ model)\n\n\tThe fitted signal is removed from the target signal per individual time series once the free parameters of the selected model are fitted." }, { "output": " Predictions are made by adding the previously removed signal once the pipeline is fitted on the residuals. Notes:\n\n\t- MOJO support is currently disabled when this setting is enabled. - The Fast centering and linear detrending options use least squares fitting. - The Robust centering and linear detrending options use `random sample consensus `_ (RANSAC) to achieve higher tolerance w.r.t. outliers. - Please see (:ref:`Custom Bounds for SEIRD Epidemic Model Parameters `) for further details on how to customize the bounds of the free SEIRD parameters." }, { "output": " The target column must correspond to *I(t)*, which represents infection cases as a function of time. For each training split and time series group, the SEIRD model is fit to the target signal by optimizing a set of free parameters for each time series group. The model's value is then subtracted from the training response, and the residuals are passed to the feature engineering and modeling pipeline. For predictions, the SEIRD model's value is added to the residual predictions from the pipeline for each time series group." }, { "output": " The following is a list of valid parameters:\n\n\t- ``N_min``\n\t- ``N_max``\n\t- ``beta_min``\n\t- ``beta_max``\n\t- ``gamma_min``\n\t- ``gamma_max``\n\t- ``delta_min``\n\t- ``delta_max``\n\t- ``alpha_min``\n\t- ``alpha_max``\n\t- ``rho_min``\n\t- ``rho_max``\n\t- ``lockdown_min``\n\t- ``lockdown_max``\n\t- ``beta_decay_min``\n\t- ``beta_decay_max``\n\t- ``beta_decay_rate_min``\n\t- ``beta_decay_rate_max``\n\n\tYou can change any subset of parameters. For example:\n\n\t::\n\n\t ts_target_trafo_epidemic_params_dict=\"{'N_min': 1000, 'beta_max': 0.2}\"\n\n\tRefer to https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology and https://arxiv.org/abs/1411.3435 for more information on the SEIRD model." }, { "output": " To get the SEIR model, set ``alpha_min=alpha_max=rho_min=rho_max=beta_decay_rate_min=beta_decay_rate_max=0`` and ``lockdown_min=lockdown_max=-1``. ``ts_target_trafo_epidemic_target``\n~\n.. dropdown:: Which SEIRD Model Component the Target Column Corresponds To\n\t:open:\n\n\tSpecify a SEIRD model component for the target column to correspond to. Select from the following:\n\n\t- I (Default): Infected\n\t- R: Recovered\n\t- D: Deceased\n\n.. _ts-target-transformation:\n\n``ts_lag_target_trafo``\n~\n.. dropdown:: Time Series Lag-Based Target Transformation\n\t:open:\n\n\tSpecify whether to use either the difference between or ratio of the current target and a lagged target." }, { "output": " Google Cloud Storage Setup\n\n\nDriverless AI lets you explore Google Cloud Storage data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with Google Cloud Storage. This setup requires you to enable authentication. If you enable GCS or GBP connectors, those file systems will be available in the UI, but you will not be able to use those connectors without authentication. In order to enable the GCS data connector with authentication, you must:\n\n1." }, { "output": " 2. Mount the JSON file to the Docker instance. 3. Specify the path to the /json_auth_file.json in the gcs_path_to_service_account_json config option. Notes:\n\n- The account JSON includes authentications as provided by the system administrator. You can be provided a JSON file that contains both Google Cloud Storage and Google BigQuery authentications, just one or the other, or none at all. - Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image." }, { "output": " Description of Configuration Attributes\n'\n\n- ``gcs_path_to_service_account_json``: Specifies the path to the /json_auth_file.json file. - ``gcs_init_path``: Specifies the starting GCS path displayed in the UI of the GCS browser. Start GCS with Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the GCS data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google Cloud Storage authentications." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, gcs\"``\n - ``gcs_path_to_service_account_json = \"/service_account_json.json\"`` \n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the GCS data connector with authentication by passing the JSON authentication file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " .. _model-settings:\n\nModel Settings\n\n\n``enable_constant_model``\n~\n.. dropdown:: Constant Models\n\t:open:\n\n\tSpecify whether to enable :ref:`constant models `. This is set to Auto (enabled) by default. ``enable_decision_tree``\n\n.. dropdown:: Decision Tree Models\n\t:open:\n\n\tSpecify whether to build Decision Tree models as part of the experiment. This is set to Auto by default. In this case, Driverless AI will build Decision Tree models if interpretability is greater than or equal to the value of ``decision_tree_interpretability_switch`` (which defaults to 7) and accuracy is less than or equal to ``decision_tree_accuracy_switch`` (which defaults to 7)." }, { "output": " GLMs are very interpretable models with one coefficient per feature, an intercept term and a link function. This is set to Auto by default (enabled if accuracy <= 5 and interpretability >= 6). ``enable_xgboost_gbm``\n\n.. dropdown:: XGBoost GBM Models\n\t:open:\n\n\tSpecify whether to build XGBoost models as part of the experiment (for both the feature engineering part and the final model). XGBoost is a type of gradient boosting method that has been widely successful in recent years due to its good regularization techniques and high accuracy." }, { "output": " In this case, Driverless AI will use XGBoost unless the number of rows * columns is greater than a threshold. This threshold is a config setting that is 100M by default for CPU and 30M by default for GPU. ``enable_lightgbm``\n~\n.. dropdown:: LightGBM Models\n\t:open:\n\n\tSpecify whether to build LightGBM models as part of the experiment. LightGBM Models are the default models. This is set to Auto (enabled) by default. ``enable_xgboost_dart``\n~\n.. dropdown:: XGBoost Dart Models\n\t:open:\n\n\tSpecify whether to use XGBoost's Dart method when building models for experiment (for both the feature engineering part and the final model)." }, { "output": " .. _enable_xgboost_rapids:\n\n``enable_xgboost_rapids``\n~\n.. dropdown:: Enable RAPIDS-cuDF extensions to XGBoost GBM/Dart\n\t:open:\n\n\tSpecify whether to enable RAPIDS extensions to XGBoost GBM/Dart. If selected, python scoring package can only be used on GPU system. The equivalent config.toml parameter is ``enable_xgboost_rapids`` and the default value is False. Disabled for dask multinode models due to bug in dask_cudf and xgboost. .. _enable_xgboost_rf:\n\n``enable_xgboost_rf``\n~\n\n.. dropdown:: Enable XGBoost RF model\n\t:open:\n\n\tSpecify whether to enable XGBoost RF mode without early stopping." }, { "output": " .. _enable_xgboost_gbm_dask:\n\n``enable_xgboost_gbm_dask``\n~\n.. dropdown:: Enable Dask_cuDF (multi-GPU) XGBoost GBM\n\t:open:\n\n\tSpecify whether to enable Dask_cudf (multi-GPU) version of XGBoost GBM. Disabled unless switched on. Only applicable for single final model without early stopping. No Shapley possible. The equivalent config.toml parameter is ``enable_xgboost_gbm_dask`` and the default value is \"auto\". .. _enable_xgboost_dart_dask:\n\n``enable_xgboost_dart_dask``\n\n.. dropdown:: Enable Dask_cuDF (multi-GPU) XGBoost Dart\n\t:open:\n\n\tSpecify whether to enable Dask_cudf (multi-GPU) version of XGBoost GBM/Dart." }, { "output": " Only applicable for single final model without early stopping. No Shapley is possible. The equivalent config.toml parameter is ``enable_xgboost_dart_dask`` and the default value is \"auto\". It is recommended to run Dask_cudf on multi gpus; if for say debugging purposes, user would like to enable them on 1 GPU, then set ``use_dask_for_1_gpu`` to True via config.toml setting. .. _enable_lightgbm_dask:\n\n``enable_lightgbm_dask``\n\n.. dropdown:: Enable Dask (multi-node) LightGBM\n\t:open:\n\n\tSpecify whether to enable multi-node LightGBM." }, { "output": " The equivalent config.toml parameter is ``enable_lightgbm_dask`` and default value is \"auto\". To enable multinode Dask see :ref:`Dask Multinode Training `. .. _enable_hyperopt_dask:\n\n``enable_hyperopt_dask``\n\n.. dropdown:: Enable Dask (multi-node/multi-GPU) hyperparameter search\n\t:open:\n\n\tSpecify whether to enable Dask (multi-node/multi-GPU) version of hyperparameter search. \"auto\" and \"on\" are same currently. Dask mode for hyperparameter search is enabled if:\n\n\t\t1) Have a :ref:`Dask multinode cluster ` or multi-GPU node and model uses 1 GPU for each model( see :ref:`num-gpus-per-model`)." }, { "output": " The equivalent config.toml parameter is ``enable_hyperopt_dask`` and the default value is \"auto\". .. _num_inner_hyperopt_trials_prefinal:\n\n``num_inner_hyperopt_trials_prefinal``\n\n.. dropdown:: Number of trials for hyperparameter optimization during model tuning only\n\t:open:\n\n\tSpecify the number of trials for Optuna hyperparameter optimization for tuning and evolution of models. If using RAPIDS or DASK, this parameter specifies the number of trials for hyperparameter optimization within XGBoost GBM/Dart and LightGBM and hyperparameter optimization keeps data on GPU entire time." }, { "output": " For small data, 100 is fine, while for larger data smaller values are reasonable if need results quickly. If using RAPIDS or DASK, hyperparameter optimization stays on GPU the entire time. The equivalent config.toml parameter is ``num_inner_hyperopt_trials_prefinal`` and the default value is 0. Note that, this is useful when there is high overhead of DAI outside inner model fit/predict (i.e the various file, process, and other DAI management processes), so this tunes without that overhead. However, this can overfit on a single fold when doing tuning or evolution, and if using Cross Validation then, averaging the fold hyperparameters can lead to unexpected results." }, { "output": " If using RAPIDS or DASK, this is number of trials for rapids-cudf hyperparameter optimization within XGBoost GBM/Dart and LightGBM, and hyperparameter optimization keeps data on GPU entire time. 0 means no trials.For small data, 100 is ok choice, while for larger data smaller values are reasonable if need results quickly. This setting applies to final model only, even if num_inner_hyperopt_trials=0. The equivalent config.toml parameter is ``num_inner_hyperopt_trials_final`` and the default value is 0." }, { "output": " The default value is -1, means all. 0 is same as choosing no Optuna trials. Might be only beneficial to optimize hyperparameters of best individual (i.e. value of 1) in ensemble. The default value is -1, means all. The equivalent config.toml parameter is ``num_hyperopt_individuals_final``\n\n``optuna_pruner``\n~\n.. dropdown:: Optuna Pruners\n\t:open:\n\n\t`Optuna Pruner `__ algorithm to use for early stopping of unpromising trials (applicable to XGBoost and LightGBM that support Optuna callbacks)." }, { "output": " To disable choose None. The equivalent config.toml parameter is ``optuna_pruner``\n\n``optuna_sampler``\n\n.. dropdown:: Optuna Samplers\n\t:open:\n\n\t`Optuna Sampler `__ algorithm to use for narrowing down and optimizing the search space (applicable to XGBoost and LightGBM that support Optuna callbacks). The default is TPESampler. To disable choose None. The equivalent config.toml parameter is ``optuna_sampler``\n\n``enable_xgboost_hyperopt_callback``\n\n\n.. dropdown:: Enable Optuna XGBoost Pruning callback\n\t:open:\n\n\tSpecify whether to enable Optuna's XGBoost Pruning callback to abort unpromising runs." }, { "output": " This not is enabled when tuning learning rate. The equivalent config.toml parameter is ``enable_xgboost_hyperopt_callback``\n\n``enable_lightgbm_hyperopt_callback``\n~\n.. dropdown:: Enable Optuna LightGBM Pruning callback\n\t:open:\n\n\tSpecify whether to enable Optuna's LightGBM Pruning callback to abort unpromising runs. This is True by default. This not is enabled when tuning learning rate. The equivalent config.toml parameter is ``enable_lightgbm_hyperopt_callback``\n\n``enable_tensorflow``\n~\n.. dropdown:: TensorFlow Models\n\t:open:\n\n\tSpecify whether to build `TensorFlow `__ models as part of the experiment (usually only for text features engineering and for the final model unless it's used exclusively)." }, { "output": " This is set to Auto by default (not used unless the number of classes is greater than 10). TensorFlow models are not yet supported by Java MOJOs (only Python scoring pipelines and C++ MOJOs are supported). .. _enable_grownet:\n\n``enable_grownet``\n\n.. dropdown:: PyTorch GrowNet Models\n\t:open:\n\n\tSpecify whether to enable PyTorch-based :ref:`GrowNet ` models. By default, this parameter is set to auto i.e Driverless decides internally whether to use the algorithm for the experiment. Set it to *on* to force the experiment to build a GrowNet model." }, { "output": " Note that MOJOs are not yet supported (only Python scoring pipelines). FTRL supports binomial and multinomial classification for categorical targets, as well as regression for continuous targets. This is set to Auto (disabled) by default. ``enable_rulefit``\n\n.. dropdown:: RuleFit Models\n\t:open:\n\n\tSpecify whether to build `RuleFit `__ models as part of the experiment. Note that MOJOs are not yet supported (only Python scoring pipelines). Note that multiclass classification is not yet supported for RuleFit models." }, { "output": " This is set to Auto (disabled) by default. .. _zero-inflated:\n\n``enable_zero_inflated_models``\n~\n.. dropdown:: Zero-Inflated Models\n\t:open:\n\n\tSpecify whether to enable the automatic addition of :ref:`zero-inflated models ` for regression problems with zero-inflated target values that meet certain conditions:\n\n\t::\n\n\t y >= 0, y.std() > y.mean()\")\n\n\tThis is set to Auto by default. ``enable_lightgbm_boosting_types``\n\n\n.. dropdown:: LightGBM Boosting Types\n\t:open:\n\n\tSpecify which boosting types to enable for LightGBM." }, { "output": " ``enable_lightgbm_cat_support``\n\n\n.. dropdown:: LightGBM Categorical Support\n\t:open:\n\n\tSpecify whether to enable LightGBM categorical feature support. This is disabled by default. Notes:\n\n\t- Only supported for CPU. - A MOJO is not built when this is enabled. .. _lightgbm_cuda:\n\n``enable_lightgbm_cuda_support``\n\n.. dropdown:: LightGBM CUDA Support\n\t:open:\n\n\tSpecify whether to enable LightGBM CUDA implementation instead of OpenCL. LightGBM CUDA is supported on Linux x86-64 environments. ``show_constant_model``\n~\n.. dropdown:: Whether to Show Constant Models in Iteration Panel\n\t:open:\n\n\tSpecify whether to show constant models in the iteration panel." }, { "output": " ``params_tensorflow``\n~\n.. dropdown:: Parameters for TensorFlow\n\t:open:\n\n\tSpecify specific parameters for TensorFlow to override Driverless AI parameters. The following is an example of how the parameters can be configured:\n\n\t::\n\n\t params_tensorflow = '{'lr': 0.01, 'add_wide': False, 'add_attention': True, 'epochs': 30,\n\t 'layers': [100, 100], 'activation': 'selu', 'batch_size': 64, 'chunk_size': 1000, 'dropout': 0.3,\n\t 'strategy': 'one_shot', 'l1': 0.0, 'l2': 0.0, 'ort_loss': 0.5, 'ort_loss_tau': 0.01, 'normalize_type': 'streaming'}'\n\n\tThe following is an example of how layers can be configured:\n\n\t::\n\n\t [500, 500, 500], [100, 100, 100], [100, 100], [50, 50]\n\n\tMore information about TensorFlow parameters can be found in the `Keras documentation `__." }, { "output": " .. _max-trees-iterations:\n\n``max_nestimators``\n~\n.. dropdown:: Max Number of Trees/Iterations\n\t:open:\n\n\tSpecify the upper limit on the number of trees (GBM) or iterations (GLM). This defaults to 3000. Depending on accuracy settings, a fraction of this limit will be used. ``n_estimators_list_no_early_stopping``\n~\n.. dropdown:: n_estimators List to Sample From for Model Mutations for Models That Do Not Use Early Stopping\n\t:open:\n\n\tFor LightGBM, the dart and normal random forest modes do not use early stopping." }, { "output": " ``min_learning_rate_final``\n~\n.. dropdown:: Minimum Learning Rate for Final Ensemble GBM Models\n\t:open:\n\n\tThis value defaults to 0.01. This is the lower limit on learning rate for final ensemble GBM models.In some cases, the maximum number of trees/iterations is insufficient for the final learning rate, which can lead to no early stopping getting triggered and poor final model performance. Then, one can try increasing the learning rate by raising this minimum, or one can try increasing the maximum number of trees/iterations." }, { "output": " This value defaults to 0.05. ``max_nestimators_feature_evolution_factor``\n\n.. dropdown:: Reduction Factor for Max Number of Trees/Iterations During Feature Evolution\n\t:open:\n\n\tSpecify the factor by which the value specified by the :ref:`max-trees-iterations` setting is reduced for tuning and feature evolution. This option defaults to 0.2. So by default, Driverless AI will produce no more than 0.2 * 3000 trees/iterations during feature evolution. .. _max_abs_score_delta_train_valid:\n\n``max_abs_score_delta_train_valid``\n~\n.. dropdown:: Max." }, { "output": " Keep in mind that the meaning of this value depends on the chosen scorer and the dataset (i.e., 0.01 for LogLoss is different than 0.01 for MSE). This option is Experimental, and only for expert use to keep model complexity low. To disable, set to 0.0. By default this option is disabled. .. _max_rel_score_delta_train_valid:\n\n``max_rel_score_delta_train_valid``\n~\n.. dropdown:: Max. relative delta between training and validation scores for tree models\n\t:open:\n\n\tModify early stopping behavior for tree-based models (LightGBM, XGBoostGBM, CatBoost) such that training score (on training data, not holdout) and validation score differ no more than this relative value (i.e., stop adding trees once abs(train_score - valid_score) > max_rel_score_delta_train_valid * abs(train_score))." }, { "output": " This option is Experimental, and only for expert use to keep model complexity low. To disable, set to 0.0. By default this option is disabled. ``min_learning_rate``\n~\n.. dropdown:: Minimum Learning Rate for Feature Engineering GBM Models\n\t:open:\n\n\tSpecify the minimum learning rate for feature engineering GBM models. This value defaults to 0.05. ``max_learning_rate``\n~\n.. dropdown:: Max Learning Rate for Tree Models\n\t:open:\n\n\tSpecify the maximum learning rate for tree models during feature engineering." }, { "output": " This value defaults to 0.5. ``max_epochs``\n\n.. dropdown:: Max Number of Epochs for TensorFlow/FTRL\n\t:open:\n\n\tWhen building TensorFlow or FTRL models, specify the maximum number of epochs to train models with (it might stop earlier). This value defaults to 10. This option is ignored if TensorFlow models and/or FTRL models is disabled. ``max_max_depth``\n~\n.. dropdown:: Max Tree Depth\n\t:open:\n\n\tSpecify the maximum tree depth. The corresponding maximum value for ``max_leaves`` is double the specified value." }, { "output": " ``max_max_bin``\n~\n.. dropdown:: Max max_bin for Tree Features\n\t:open:\n\n\tSpecify the maximum ``max_bin`` for tree features. This value defaults to 256. ``rulefit_max_num_rules``\n~\n.. dropdown:: Max Number of Rules for RuleFit\n\t:open:\n\n\tSpecify the maximum number of rules to be used for RuleFit models. This defaults to -1, which specifies to use all rules. .. _ensemble_meta_learner:\n\n``ensemble_meta_learner``\n~\n.. dropdown:: Ensemble Level for Final Modeling Pipeline\n\t:open:\n\n\tModel to combine base model predictions, for experiments that create a final pipeline\n\tconsisting of multiple base models:\n\n\t- blender: Creates a linear blend with non-negative weights that add to 1 (blending) - recommended\n\t- extra_trees: Creates a tree model to non-linearly combine the base models (stacking) - experimental, and recommended to also set enable :ref:`cross_validate_meta_learner`." }, { "output": " (Default)\n\t- 0 = No ensemble, only final single model on validated iteration/tree count. Note that holdout predicted probabilities will not be available. (For more information, refer to this :ref:`FAQ `.) - 1 = 1 model, multiple ensemble folds (cross-validation)\n\t- 2 = 2 models, multiple ensemble folds (cross-validation)\n\t- 3 = 3 models, multiple ensemble folds (cross-validation)\n\t- 4 = 4 models, multiple ensemble folds (cross-validation)\n\n\tThe equivalent config.toml parameter is ``fixed_ensemble_level``." }, { "output": " Especially recommended for ensemble_meta_learner='extra_trees', to make unbiased training holdout predictions. No MOJO will be created if this setting is enabled. Not needed for ensemble_meta_learner='blender'. ``cross_validate_single_final_model``\n~\n.. dropdown:: Cross-Validate Single Final Model\n\t:open:\n\n\tDriverless AI normally produces a single final model for low accuracy settings (typically, less than 5). When the Cross-validate single final model option is enabled (default for regular experiments), Driverless AI will perform cross-validation to determine optimal parameters and early stopping before training the final single modeling pipeline on the entire training data." }, { "output": " This also creates holdout predictions for all non-time-series experiments with a single final model. Note that the setting for this option is ignored for time-series experiments or when a validation dataset is provided. ``parameter_tuning_num_models``\n~\n.. dropdown:: Number of Models During Tuning Phase\n\t:open:\n\n\tSpecify the number of models to tune during pre-evolution phase. Specify a lower value to avoid excessive tuning, or specify a higher to perform enhanced tuning. This option defaults to -1 (auto)." }, { "output": " This is set to off by default. Choose from the following options:\n\n\t- auto: sample both classes as needed, depending on data\n\t- over_under_sampling: over-sample the minority class and under-sample the majority class, depending on data\n\t- under_sampling: under-sample the majority class to reach class balance\n\t- off: do not perform any sampling\n\n\tThis option is closely tied with the Imbalanced Light GBM and Imbalanced XGBoost GBM models, which can be enabled/disabled on the Recipes tab under :ref:`included_models`." }, { "output": " If the target fraction proves to be above the allowed imbalance threshold, then sampling will be triggered. - If this option is ENABLED and the ImbalancedLightGBM and/or ImbalancedXGBoostGBM models are DISABLED, then no special sampling technique will be performed. The setting here will be ignored. ``imbalance_sampling_threshold_min_rows_original``\n\n.. dropdown:: Threshold for Minimum Number of Rows in Original Training Data to Allow Imbalanced Sampling\n\t:open:\n\n\tSpecify a threshold for the minimum number of rows in the original training data that allow imbalanced sampling." }, { "output": " ``imbalance_ratio_sampling_threshold``\n\n.. dropdown:: Ratio of Majority to Minority Class for Imbalanced Binary Classification to Trigger Special Sampling Techniques (if Enabled)\n\t:open:\n\n\tFor imbalanced binary classification problems, specify the ratio of majority to minority class. Special imbalanced models with sampling techniques are enabled when the ratio is equal to or greater than the specified ratio. This value defaults to 5. ``heavy_imbalance_ratio_sampling_threshold``\n\n.. dropdown:: Ratio of Majority to Minority Class for Heavily Imbalanced Binary Classification to Only Enable Special Sampling Techniques (if Enabled)\n\t:open:\n\n\tFor heavily imbalanced binary classification, specify the ratio of the majority to minority class equal and above which to enable only special imbalanced models on the full original data without upfront sampling." }, { "output": " ``imbalance_sampling_number_of_bags``\n~\n.. dropdown:: Number of Bags for Sampling Methods for Imbalanced Binary Classification (if Enabled)\n\t:open:\n\n\tSpecify the number of bags for sampling methods for imbalanced binary classification. This value defaults to -1. ``imbalance_sampling_max_number_of_bags``\n~\n.. dropdown:: Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification\n\t:open:\n\n\tSpecify the limit on the number of bags for sampling methods for imbalanced binary classification." }, { "output": " ``imbalance_sampling_max_number_of_bags_feature_evolution``\n~\n.. dropdown:: Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification During Feature Evolution Phase\n\t:open:\n\n\tSpecify the limit on the number of bags for sampling methods for imbalanced binary classification. This value defaults to 3. Note that this setting only applies to shift, leakage, tuning, and feature evolution models. To limit final models, use the Hard Limit on Number of Bags for Sampling Methods for Imbalanced Binary Classification setting." }, { "output": " This setting controls the approximate number of bags and is only active when the \"Hard limit on number of bags for sampling methods for imbalanced binary classification during feature evolution phase\" option is set to -1. This value defaults to 1. ``imbalance_sampling_target_minority_fraction``\n~\n.. dropdown:: Target Fraction of Minority Class After Applying Under/Over-Sampling Techniques\n\t:open:\n\n\tSpecify the target fraction of a minority class after applying under/over-sampling techniques. A value of 0.5 means that models/algorithms will be given a balanced target class distribution." }, { "output": " This value defaults to -1. ``ftrl_max_interaction_terms_per_degree``\n~\n.. dropdown:: Max Number of Automatic FTRL Interactions Terms for 2nd, 3rd, 4th order interactions terms (Each)\n\t:open:\n\n\tSamples the number of automatic FTRL interactions terms to no more than this value (for each of 2nd, 3rd, 4th order terms). This value defaults to 10000\n\n``enable_bootstrap``\n\n.. dropdown:: Whether to Enable Bootstrap Sampling for Validation and Test Scores\n\t:open:\n\n\tSpecify whether to enable bootstrap sampling." }, { "output": " This is enabled by default. ``tensorflow_num_classes_switch``\n~\n.. dropdown:: For Classification Problems with This Many Classes, Default to TensorFlow\n\t:open:\n\n\tSpecify the number of classes above which to use TensorFlow when it is enabled. Others model that are set to Auto will not be used above this number. (Models set to On, however, are still used.) This value defaults to 10. .. _compute-intervals:\n\n``prediction_intervals``\n\n.. dropdown:: Compute Prediction Intervals\n\t:open:\n\n\tSpecify whether to compute empirical prediction intervals based on holdout predictions." }, { "output": " Install the Driverless AI AWS Community AMI\n-\n\nWatch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same. Environment\n~\n\n++-++-+\n| Provider | Instance Type | Num GPUs | Suitable for |\n++=++=+\n| AWS | p2.xlarge | 1 | Experimentation |\n| +-++-+\n| | p2.8xlarge | 8 | Serious use |\n| +-++-+\n| | p2.16xlarge | 16 | Serious use |\n| +-++-+\n| | p3.2xlarge | 1 | Experimentation |\n| +-++-+\n| | p3.8xlarge | 4 | Serious use |\n| +-++-+\n| | p3.16xlarge | 8 | Serious use |\n| +-++-+\n| | g3.4xlarge | 1 | Experimentation |\n| +-++-+\n| | g3.8xlarge | 2 | Experimentation |\n| +-++-+\n| | g3.16xlarge | 4 | Serious use |\n++-++-+\n\n\nInstalling the EC2 Instance\n~\n\n1." }, { "output": " 2. In the upper right corner of the Amazon Web Services page, set the location drop-down. (Note: We recommend selecting the US East region because H2O's resources are stored there. It also offers more instance types than other regions.) .. image:: ../images/ami_location_dropdown.png\n :align: center\n\n\n3. Select the EC2 option under the Compute section to open the EC2 Dashboard. .. image:: ../images/ami_select_ec2.png\n :align: center\n\n4. Click the Launch Instance button under the Create Instance section." }, { "output": " Under Community AMIs, search for h2oai, and then select the version that you want to launch. .. image:: ../images/ami_select_h2oai_ami.png\n :align: center\n\n6. On the Choose an Instance Type page, select GPU compute in the Filter by dropdown. This will ensure that your Driverless AI instance will run on GPUs. Select a GPU compute instance from the available options. (We recommend at least 32 vCPUs.) Click the Next: Configure Instance Details button. .. image:: ../images/ami_choose_instance_type.png\n :align: center\n\n7." }, { "output": " Create a VPC or use an existing one, and ensure that \"Auto-Assign Public IP\" is enabled and associated to your subnet. Click Next: Add Storage. .. image:: ../images/ami_configure_instance_details.png\n :align: center\n\n8. Specify the Storage Device settings. Note again that Driverless AI requires 10 GB to run and will stop working of less than 10 GB is available. The machine should have a minimum of 30 GB of disk space. Click Next: Add Tags. .. image:: ../images/ami_add_storage.png\n :align: center\n\n9." }, { "output": " Click Next: Configure Security Group. 10. Add the following security rules to enable SSH access to Driverless AI, then click Review and Launch. +-+-+-++-+\n| Type | Protocol | Port Range | Source | Description |\n+=+=+=++=+\n| SSH | TCP | 22 | Anywhere 0.0.0.0/0 | |\n+-+-+-++-+\n| Custom TCP Rule | TCP | 12345 | Anywhere 0.0.0.0/0 | Launch DAI |\n+-+-+-++-+\n\n .. image:: ../images/ami_add_security_rules.png\n :align: center\n\n11. Review the configuration, and then click Launch." }, { "output": " A popup will appear prompting you to select a key pair. This is required in order to SSH into the instance. You can select your existing key pair or create a new one. Be sure to accept the acknowledgement, then click Launch Instances to start the new instance. .. image:: ../images/ami_select_key_pair.png\n :align: center\n\n13. Upon successful completion, a message will display informing you that your instance is launching. Click the View Instances button to see information about the instance including the IP address." }, { "output": " 14. Open a Terminal window and SSH into the IP address of the AWS instance. Replace the DNS name below with your instance DNS. .. code-block:: bash \n\n ssh -i \"mykeypair.pem\" ubuntu@ec2-34-230-6-230.compute-1.amazonaws.com \n\n Note: If you receive a \"Permissions 0644 for \u2018mykeypair.pem\u2019 are too open\" error, run the following command to give the user read permission and remove the other permissions. .. code-block:: bash\n\n chmod 400 mykeypair.pem\n\n15. If you selected a GPU-compute instance, then you must enable persistence and optimizations of the GPU." }, { "output": " Note also that these commands need to be run once every reboot. Refer to the following for more information: \n\n - http://docs.nvidia.com/deploy/driver-persistence/index.html\n - https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/optimize_gpu.html\n - https://www.migenius.com/articles/realityserver-on-aws\n\n .. code-block:: bash\n\n # g3:\n sudo nvidia-persistenced persistence-mode\n sudo nvidia-smi -acp 0\n sudo nvidia-smi auto-boost-permission=0\n sudo nvidia-smi auto-boost-default=0\n sudo nvidia-smi -ac \"2505,1177\"\n\n # p2:\n sudo nvidia-persistenced persistence-mode\n sudo nvidia-smi -acp 0\n sudo nvidia-smi auto-boost-permission=0\n sudo nvidia-smi auto-boost-default=0\n sudo nvidia-smi -ac \"2505,875\"\n\n # p3:\n sudo nvidia-persistenced persistence-mode\n sudo nvidia-smi -acp 0\n sudo nvidia-smi -ac \"877,1530\"\n\n\n16." }, { "output": " For example:\n\n .. code-block:: bash\n\n scp -i /path/mykeypair.pem ubuntu@ec2-34-230-6-230.compute-1.amazonaws.com:/path/to/file/to/be/copied/example.csv /path/of/destination/on/local/machine\n\n where:\n \n * ``i`` is the identify file option\n * ``mykeypair`` is the name of the private keypair file\n * ``ubuntu`` is the name of the private keypair file\n * ``ec2-34-230-6-230.compute-1.amazonaws.com`` is the public DNS name of the instance\n * ``example.csv`` is the file to transfer\n\n17." }, { "output": " Sign in to Driverless AI with the username h2oai and use the AWS InstanceID as the password. You will be prompted to enter your Driverless AI license key when you log in for the first time. .. code-block:: bash\n\n http://Your-Driverless-AI-Host-Machine:12345\n\nStopping the EC2 Instance\n~\n\nThe EC2 instance will continue to run even when you close the aws.amazon.com portal. To stop the instance: \n\n1. On the EC2 Dashboard, click the Running Instances link under the Resources section. 2. Select the instance that you want to stop." }, { "output": " .. _nlp-settings:\n\nNLP Settings\n\n\n``enable_tensorflow_textcnn``\n~\n.. dropdown:: Enable Word-Based CNN TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Word-based CNN TensorFlow models as transformers for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs. ``enable_tensorflow_textbigru``\n~\n.. dropdown:: Enable Word-Based BiGRU TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Word-based BiG-RU TensorFlow models as transformers for NLP." }, { "output": " We recommend that you disable this option on systems that do not use GPUs. ``enable_tensorflow_charcnn``\n~\n.. dropdown:: Enable Character-Based CNN TensorFlow Models for NLP\n\t:open:\n\n\tSpecify whether to use out-of-fold predictions from Character-level CNN TensorFlow models as transformers for NLP. This option is ignored if TensorFlow is disabled. We recommend that you disable this option on systems that do not use GPUs. ``enable_pytorch_nlp_model``\n\n.. dropdown:: Enable PyTorch Models for NLP\n\t:open:\n\n\tSpecify whether to enable pretrained PyTorch models and fine-tune them for NLP tasks." }, { "output": " You need to set this to On if you want to use the PyTorch models like BERT for modeling. Only the first text column will be used for modeling with these models. We recommend that you disable this option on systems that do not use GPUs. ``enable_pytorch_nlp_transformer``\n\n.. dropdown:: Enable pre-trained PyTorch Transformers for NLP\n\t:open:\n\n\tSpecify whether to enable pretrained PyTorch models for NLP tasks. This is set to Auto by default, and is enabled for text-dominated problems only. You need to set this to On if you want to use the PyTorch models like BERT for feature engineering (via fitting a linear model on top of pretrained embeddings)." }, { "output": " Notes:\n\n\t- This setting requires an Internet connection. ``pytorch_nlp_pretrained_models``\n~\n.. dropdown:: Select Which Pretrained PyTorch NLP Models to Use\n\t:open:\n\n\tSpecify one or more pretrained PyTorch NLP models to use. Select from the following:\n\n\t- bert-base-uncased (Default)\n\t- distilbert-base-uncased (Default)\n\t- xlnet-base-cased\n\t- xlm-mlm-enfr-1024\n\t- roberta-base\n\t- albert-base-v2\n\t- camembert-base\n\t- xlm-roberta-base\n\n\tNotes:\n\n\t- This setting requires an Internet connection. - Models that are not selected by default may not have MOJO support." }, { "output": " ``tensorflow_max_epochs_nlp``\n~\n.. dropdown:: Max TensorFlow Epochs for NLP\n\t:open:\n\n\tWhen building TensorFlow NLP features (for text data), specify the maximum number of epochs to train feature engineering models with (it might stop earlier). The higher the number of epochs, the higher the run time. This value defaults to 2 and is ignored if TensorFlow models is disabled. ``enable_tensorflow_nlp_accuracy_switch``\n\n.. dropdown:: Accuracy Above Enable TensorFlow NLP by Default for All Models\n\t:open:\n\n\tSpecify the accuracy threshold." }, { "output": " At lower accuracy, TensorFlow NLP transformations will only be created as a mutation. This value defaults to 5. ``pytorch_nlp_fine_tuning_num_epochs``\n\n.. dropdown:: Number of Epochs for Fine-Tuning of PyTorch NLP Models\n\t:open:\n\n\tSpecify the number of epochs used when fine-tuning PyTorch NLP models. This value defaults to 2. ``pytorch_nlp_fine_tuning_batch_size``\n\n.. dropdown:: Batch Size for PyTorch NLP Models\n\t:open:\n\n\tSpecify the batch size for PyTorch NLP models. This value defaults to 10." }, { "output": " ``pytorch_nlp_fine_tuning_padding_length``\n\n.. dropdown:: Maximum Sequence Length for PyTorch NLP Models\n\t:open:\n\n\tSpecify the maximum sequence length (padding length) for PyTorch NLP models. This value defaults to 100. Note: Large models and padding lengths require more memory. ``pytorch_nlp_pretrained_models_dir``\n~\n.. dropdown:: Path to Pretrained PyTorch NLP Models\n\t:open:\n\n\tSpecify a path to pretrained PyTorch NLP models. To get all available models, download http://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/pretrained/bert_models.zip, then extract the folder and store it in a directory on the instance where Driverless AI is installed:\n\n\t::\n\n\t pytorch_nlp_pretrained_models_dir = /path/on/server/to/bert_models_folder\n\n.. _tensorflow_nlp_pretrained_embeddings_file_path:\n\n``tensorflow_nlp_pretrained_embeddings_file_path``\n\n.. dropdown:: Path to Pretrained Embeddings for TensorFlow NLP Models\n\t:open:\n\n\tSpecify a path to pretrained embeddings that will be used for the TensorFlow NLP models." }, { "output": " Notes:\n\n\t- If an S3 location is specified, an S3 access key ID and S3 secret access key can also be specified with the :ref:`tensorflow_nlp_pretrained_s3_access_key_id` and :ref:`tensorflow_nlp_pretrained_s3_secret_access_key` expert settings respectively. - You can download the Glove embeddings from `here `__ and specify the local path in this box. - You can download the fasttext embeddings from `here `__ and specify the local path in this box." }, { "output": " Please refer to `this code sample `__ for creating custom embeddings that can be passed on to this option. - If this field is left empty, embeddings will be trained from scratch. .. _tensorflow_nlp_pretrained_s3_access_key_id:\n\n``tensorflow_nlp_pretrained_s3_access_key_id``\n\n.. dropdown:: S3 access key ID to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location\n\t:open:\n\n\tSpecify an S3 access key ID to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location." }, { "output": " .. _tensorflow_nlp_pretrained_s3_secret_access_key:\n\n``tensorflow_nlp_pretrained_s3_secret_access_key``\n\n.. dropdown:: S3 secret access key to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location\n\t:open:\n\n\tSpecify an S3 secret access key to use when ``tensorflow_nlp_pretrained_embeddings_file_path`` is set to an S3 location. For more information, see :ref:`the entry on the tensorflow_nlp_pretrained_embeddings_file_path ` expert setting." }, { "output": " If this is disabled, the embedding layer will be frozen. All other weights, however, will still be fine-tuned. This is disabled by default. ``text_fraction_for_text_dominated_problem``\n\n.. dropdown:: Fraction of Text Columns Out of All Features to be Considered a Text-Dominanted Problem\n\t:open:\n\n\tSpecify the fraction of text columns out of all features to be considered as a text-dominated problem. This value defaults to 0.3. Specify when a string column will be treated as text (for an NLP problem) or just as a standard categorical variable." }, { "output": " This value defaults to 0.3. ``text_transformer_fraction_for_text_dominated_problem``\n\n.. dropdown:: Fraction of Text per All Transformers to Trigger That Text Dominated\n\t:open:\n\n\tSpecify the fraction of text columns out of all features to be considered a text-dominated problem. This value defaults to 0.3. ``string_col_as_text_threshold``\n\n.. dropdown:: Threshold for String Columns to be Treated as Text\n\t:open:\n\n\tSpecify the threshold value (from 0 to 1) for string columns to be treated as text (0.0 - text; 1.0 - string)." }, { "output": " .. _quick-start-tables:\n\nQuick-Start Tables by Environment\n-\n\nUse the following tables for Cloud, Server, and Desktop to find the right setup instructions for your environment. Cloud\n~\n\nRefer to the following for more information about instance types:\n\n- `AWS Instance Types `__\n- `Azure Instance Types `__\n- `Google Compute Instance Types `__\n\n++-++-++\n| Provider | Instance Type | Num GPUs | Suitable for | Refer to Section |\n++=++=++\n| NVIDIA GPU Cloud | | | Serious use | :ref:`install-on-nvidia-dgx` |\n++-++-++\n| AWS | p2.xlarge | 1 | Experimentation | :ref:`install-on-aws` |\n| +-++-+ |\n| | p2.8xlarge | 8 | Serious use | |\n| +-++-+ |\n| | p2.16xlarge | 16 | Serious use | |\n| +-++-+ |\n| | p3.2xlarge | 1 | Experimentation | |\n| +-++-+ |\n| | p3.8xlarge | 4 | Serious use | |\n| +-++-+ |\n| | p3.16xlarge | 8 | Serious use | |\n| +-++-+ |\n| | g3.4xlarge | 1 | Experimentation | |\n| +-++-+ |\n| | g3.8xlarge | 2 | Experimentation | |\n| +-++-+ |\n| | g3.16xlarge | 4 | Serious use | |\n++-++-++\n| Azure | Standard_NV6 | 1 | Experimentation | :ref:`install-on-azure` |\n| +-++-+ |\n| | Standard_NV12 | 2 | Experimentation | |\n| +-++-+ |\n| | Standard_NV24 | 4 | Serious use | |\n| +-++-+ |\n| | Standard_NC6 | 1 | Experimentation | |\n| +-++-+ |\n| | Standard_NC12 | 2 | Experimentation | |\n| +-++-+ |\n| | Standard_NC24 | 4 | Serious use | |\n++-++-++\n| Google Compute | | :ref:`install-on-google-compute` |\n++-++-++\n\nServer\n\n\n+-+-+-++\n| Operating System | GPUs?" }, { "output": " JDBC Setup\n\n\nDriverless AI lets you explore Java Database Connectivity (JDBC) data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with JDBC. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Tested Databases\n\n\nThe following databases have been tested for minimal functionality." }, { "output": " We recommend that you test out your JDBC driver even if you do not see it on list of tested databases. See the :ref:`untested-jdbc-driver` section at the end of this chapter for information on how to try out an untested JDBC driver. - Oracle DB\n- PostgreSQL\n- Amazon Redshift\n- Teradata\n\nDescription of Configuration Attributes\n~\n \n- ``jdbc_app_configs``: Configuration for the JDBC connector. This is a JSON/Dictionary String with multiple keys. Note: This requires a JSON key (typically the name of the database being configured) to be associated with a nested JSON that contains the ``url``, ``jarpath``, and ``classpath`` fields." }, { "output": " Double quotation marks (``\"...\"``) must be used to denote keys and values *within* the JSON dictionary, and *outer* quotations must be formatted as either ``\"\"\"``, ``'``, or ``'``. Depending on how the configuration value is applied, different forms of outer quotations may be required. The following examples show two unique methods for applying outer quotations. - Configuration value applied with the config.toml file:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}\"\"\"\n\n - Configuration value applied with an environment variable:\n \n ::\n \n DRIVERLESS_AI_JDBC_APP_CONFIGS='{\"my_json_string\": \"value\", \"json_key_2\": \"value2\"}'\n \n For example:\n \n ::\n \n DRIVERLESS_AI_JDBC_APP_CONFIGS='{\n \"postgres\": {\"url\": \"jdbc:postgresql://192.xxx.x.xxx:aaaa:/name_of_database;user=name_of_user;password=your_password\",\"jarpath\": \"/config/postgresql-xx.x.x.jar\",\"classpath\": \"org.postgresql.Driver\"}, \n \"postgres-local\": {\"url\": \"jdbc:postgresql://123.xxx.xxx.xxx:aaaa/name_of_database\",\"jarpath\": \"/config/postgresql-xx.x.x.jar\",\"classpath\": \"org.postgresql.Driver\"},\n \"ms-sql\": {\"url\": \"jdbc:sqlserver://192.xxx.x.xxx:aaaa;databaseName=name_of_database;user=name_of_user;password=your_password\",\"Username\":\"your_username\",\"passsword\":\"your_password\",\"jarpath\": \"/config/sqljdbc42.jar\",\"classpath\": \"com.microsoft.sqlserver.jdbc.SQLServerDriver\"},\n \"oracle\": {\"url\": \"jdbc:oracle:thin:@192.xxx.x.xxx:aaaa/orclpdb1\",\"jarpath\": \"ojdbc7.jar\",\"classpath\": \"oracle.jdbc.OracleDriver\"},\n \"db2\": {\"url\": \"jdbc:db2://127.x.x.x:aaaaa/name_of_database\",\"jarpath\": \"db2jcc4.jar\",\"classpath\": \"com.ibm.db2.jcc.DB2Driver\"},\n \"mysql\": {\"url\": \"jdbc:mysql://192.xxx.x.xxx:aaaa;\",\"jarpath\": \"mysql-connector.jar\",\"classpath\": \"com.mysql.jdbc.Driver\"},\n \"Snowflake\": {\"url\": \"jdbc:snowflake://.snowflakecomputing.com/?\",\"jarpath\": \"/config/snowflake-jdbc-x.x.x.jar\",\"classpath\": \"net.snowflake.client.jdbc.SnowflakeDriver\"},\n \"Derby\": {\"url\": \"jdbc:derby://127.x.x.x:aaaa/name_of_database\",\"jarpath\": \"/config/derbyclient.jar\",\"classpath\": \"org.apache.derby.jdbc.ClientDriver\"}\n }'\\\n\n- ``jdbc_app_jvm_args``: Extra jvm args for JDBC connector." }, { "output": " - ``jdbc_app_classpath``: Optionally specify an alternative classpath for the JDBC connector. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Retrieve the JDBC Driver\n\n\n1. Download JDBC Driver JAR files:\n\n - `Oracle DB `_\n\n - `PostgreSQL `_\n\n - `Amazon Redshift `_\n\n - `Teradata `_\n\n Note: Remember to take note of the driver classpath, as it is needed for the configuration steps (for example, org.postgresql.Driver)." }, { "output": " Copy the driver JAR to a location that can be mounted into the Docker container. Note: The folder storing the JDBC jar file must be visible/readable by the dai process user. Enable the JDBC Connector\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the JDBC connector for PostgresQL. Note that the JDBC connection strings will vary depending on the database that is used. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,hdfs,jdbc\" \\\n -e DRIVERLESS_AI_JDBC_APP_CONFIGS='{\"postgres\": \n {\"url\": \"jdbc:postgres://localhost:5432/my_database\", \n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\", \n \"classpath\": \"org.postgresql.Driver\"}}' \\ \n -e DRIVERLESS_AI_JDBC_APP_JVM_ARGS=\"-Xmx2g\" \\\n -p 12345:12345 \\\n -v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure JDBC options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options:\n\n .. code-block:: bash \n\n enabled_file_systems = \"file, upload, jdbc\"\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgres://localhost:5432/my_database\",\n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n 2. Mount the config.toml file and requisite JAR files into the Docker container." }, { "output": " Notes: \n\n - The JDBC connection strings will vary depending on the database that is used. - The configuration requires a JSON key (typically the name of the database being configured) to be associated with a nested JSON that contains the ``url``, ``jarpath``, and ``classpath`` fields. In addition, this should take the format:\n\n ::\n\n \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\", \n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n\n 1." }, { "output": " For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Edit the following values in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"upload, file, hdfs, jdbc\"\n\n # Configuration for JDBC Connector. # JSON/Dictionary String with multiple keys. # Format as a single line without using carriage returns (the following example is formatted for readability). # Use triple quotations to ensure that the text is read as a single string. # Example:\n # \"\"\"{\n # \"postgres\": {\n # \"url\": \"jdbc:postgresql://ip address:port/postgres\",\n # \"jarpath\": \"/path/to/postgres_driver.jar\",\n # \"classpath\": \"org.postgresql.Driver\"\n # },\n # \"mysql\": {\n # \"url\":\"mysql connection string\",\n # \"jarpath\": \"/path/to/mysql_driver.jar\",\n # \"classpath\": \"my.sql.classpath.Driver\"\n # }\n # }\"\"\"\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgres://localhost:5432/my_database\",\n \"jarpath\": \"/path/to/postgresql/jdbc/driver.jar\",\n \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n # extra jvm args for jdbc connector\n jdbc_app_jvm_args = \"\"\n\n # alternative classpath for jdbc connector\n jdbc_app_classpath = \"\"\n\n 3." }, { "output": " Adding Datasets Using JDBC\n\n\nAfter the JDBC connector is enabled, you can add datasets by selecting JDBC from the Add Dataset (or Drag and Drop) drop-down menu. .. figure:: ../images/jdbc.png\n :alt: Make JDBC Query\n :scale: 30%\n\n1. Click on the Add Dataset button on the Datasets page. 2. Select JDBC from the list that appears. 3. Click on the Select JDBC Connection button to select a JDBC configuration. 4. The form will populate with the JDBC Database, URL, Driver, and Jar information. Complete the following remaining fields:\n\n - JDBC Username: Enter your JDBC username." }, { "output": " (See the *Notes* section)\n\n - Destination Name: Enter a name for the new dataset. - (Optional) ID Column Name: Enter a name for the ID column. Specify this field when making large data queries. Notes:\n\n - Do not include the password as part of the JDBC URL. Instead, enter the password in the JDBC Password field. The password is entered separately for security purposes. - Due to resource sharing within Driverless AI, the JDBC Connector is only allocated a relatively small amount of memory. - When making large queries, the ID column is used to partition the data into manageable portions." }, { "output": " - If a query that is larger than the maximum memory allocation is made without specifying an ID column, the query will not complete successfully. 5. Write a SQL Query in the format of the database that you want to query. (See the `Query Examples <#queryexamples>`__ section below.) The format will vary depending on the database that is used. 6. Click the Click to Make Query button to execute the query. The time it takes to complete depends on the size of the data being queried and the network speeds to the database." }, { "output": " .. _queryexamples:\n\nQuery Examples\n\n\nThe following are sample configurations and queries for Oracle DB and PostgreSQL:\n\n.. tabs:: \n .. group-tab:: Oracle DB\n\n 1. Configuration:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"oracledb\": {\"url\": \"jdbc:oracle:thin:@localhost:1521/oracledatabase\", \"jarpath\": \"/home/ubuntu/jdbc-jars/ojdbc8.jar\", \"classpath\": \"oracle.jdbc.OracleDriver\"}}\"\"\"\n\n 2. Sample Query:\n\n - Select oracledb from the Select JDBC Connection dropdown menu. - JDBC Username: ``oracleuser``\n - JDBC Password: ``oracleuserpassword``\n - ID Column Name:\n - Query:\n\n ::\n\n SELECT MIN(ID) AS NEW_ID, EDUCATION, COUNT(EDUCATION) FROM my_oracle_schema.creditcardtrain GROUP BY EDUCATION\n\n Note: Because this query does not specify an ID Column Name, it will only work for small data." }, { "output": " 3. Click the Click to Make Query button to execute the query. .. group-tab:: PostgreSQL \n\n 1. Configuration:\n\n ::\n\n jdbc_app_configs = \"\"\"{\"postgres\": {\"url\": \"jdbc:postgresql://localhost:5432/postgresdatabase\", \"jarpath\": \"/home/ubuntu/postgres-artifacts/postgres/Driver.jar\", \"classpath\": \"org.postgresql.Driver\"}}\"\"\"\n\n 2. Sample Query:\n\n - Select postgres from the Select JDBC Connection dropdown menu. - JDBC Username: ``postgres_user``\n - JDBC Password: ``pguserpassword``\n - ID Column Name: ``id``\n - Query:\n\n ::\n\n SELECT * FROM loan_level WHERE LOAN_TYPE = 5 (selects all columns from table loan_level with column LOAN_TYPE containing value 5)\n\n 3." }, { "output": " .. _untested-jdbc-driver:\n\nAdding an Untested JDBC Driver\n\n\nWe encourage you to try out JDBC drivers that are not tested in house. .. tabs:: \n .. group-tab:: Docker Image Installs\n\n 1. Download the JDBC jar for your database. 2. Move your JDBC jar file to a location that DAI can access. 3. Start the Driverless AI Docker image using the JDBC-specific environment variables. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"upload,file,hdfs,s3,recipe_file,jdbc\" \\\n -e DRIVERLESS_AI_JDBC_APP_CONFIGS=\"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\\ \n -e DRIVERLESS_AI_JDBC_APP_JVM_ARGS=\"-Xmx2g\" \\\n -p 12345:12345 \\\n -v /path/to/local/postgresql/jdbc/driver.jar:/path/to/postgresql/jdbc/driver.jar \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n 1." }, { "output": " 2. Move your JDBC jar file to a location that DAI can access. 3. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n .. code-block:: bash \n\n enabled_file_systems = \"upload, file, hdfs, s3, recipe_file, jdbc\"\n jdbc_app_configs = \"\"\"{\"my_jdbc_database\": {\"url\": \"jdbc:my_jdbc_database://hostname:port/database\",\n \"jarpath\": \"/path/to/my/jdbc/database.jar\", \n \"classpath\": \"com.my.jdbc.Driver\"}}\"\"\"\n #Optional arguments\n jdbc_app_jvm_args = \"\"\n jdbc_app_classpath = \"\"\n\n 4." }, { "output": " .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/jdbc/driver.jar:/path/in/docker/jdbc/driver.jar \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n 1." }, { "output": " 2. Move your JDBC jar file to a location that DAI can access. 3. Modify the following config.toml settings. Note that these can also be specified as environment variables when starting Driverless AI in Docker:\n\n ::\n\n # enable the JDBC file system\n enabled_file_systems = \"upload, file, hdfs, s3, recipe_file, jdbc\"\n\n # Configure the JDBC Connector. # JSON/Dictionary String with multiple keys. # Format as a single line without using carriage returns (the following example is formatted for readability)." }, { "output": " MinIO Setup\n-\n\nThis section provides instructions for configuring Driverless AI to work with `MinIO `__. Note that unlike S3, authentication must also be configured when the MinIO data connector is specified. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using." }, { "output": " - ``minio_access_key_id``: The MinIO access key. - ``minio_secret_access_key``: The MinIO secret access key. - ``minio_skip_cert_verification``: If this is set to true, then MinIO connector will skip certificate verification. This is set to false by default. - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Enable MinIO with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the MinIO data connector with authentication by passing an endpoint URL, access key ID, and an access key." }, { "output": " This lets you reference data stored in MinIO directly using the endpoint URL, for example: http:////datasets/iris.csv. .. code-block:: bash\n :substitutions:\n\n \t nvidia-docker run \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,minio\" \\\n -e DRIVERLESS_AI_MINIO_ENDPOINT_URL=\"\"\n -e DRIVERLESS_AI_MINIO_ACCESS_KEY_ID=\"\" \\\n -e DRIVERLESS_AI_MINIO_SECRET_ACCESS_KEY=\"\" \\ \n -e DRIVERLESS_AI_MINIO_SKIP_CERT_VERIFICATION=\"false\" \\\n -p 12345:12345 \\\n init -it rm \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure MinIO options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, minio\"``\n - ``minio_endpoint_url = \"\"``\n - ``minio_access_key_id = \"\"``\n - ``minio_secret_access_key = \"\"``\n - ``minio_skip_cert_verification = \"false\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n \n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n\n .. group-tab:: Native Installs\n\n This example enables the MinIO data connector with authentication by passing an endpoint URL, access key ID, and an access key." }, { "output": " This allows users to reference data stored in MinIO directly using the endpoint URL, for example: http:////datasets/iris.csv. 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file." }, { "output": " (jdbc_app_configs)\n # hive: Hive Connector, remember to configure Hive below. (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, minio\"\n\n # MinIO Connector credentials\n minio_endpoint_url = \"\"\n minio_access_key_id = \"\"\n minio_secret_access_key = \"\"\n minio_skip_cert_verification = \"false\"\n\n 3." }, { "output": " .. _install-on-azure:\n\nInstall on Azure\n\n\nThis section describes how to install the Driverless AI image from Azure. Note: Prior versions of the Driverless AI installation and upgrade on Azure were done via Docker. This is no longer the case as of version 1.5.2. Watch the installation video `here `__. Note that some of the images in this video may change between releases, but the installation steps remain the same." }, { "output": " Log in to your Azure portal at https://portal.azure.com, and click the Create a Resource button. 2. Search for and select H2O DriverlessAI in the Marketplace. .. image:: ../images/azure_select_driverless_ai.png\n :align: center\n\n3. Click Create. This launches the H2O DriverlessAI Virtual Machine creation process. .. image:: ../images/azure_search_for_dai.png\n :align: center\n\n4. On the Basics tab:\n\n a. Enter a name for the VM. b. Select the Disk Type for the VM. Use HDD for GPU instances. c. Enter the name that you will use when connecting to the machine through SSH." }, { "output": " e. Specify the Subscription option. (This should be Pay-As-You-Go.) f. Enter a name unique name for the resource group. g. Specify the VM region. Click OK when you are done. .. image:: ../images/azure_basics_tab.png\n :align: center\n\n5. On the Size tab, select your virtual machine size. Specify the HDD disk type and select a configuration. We recommend using an N-Series type, which comes with a GPU. Also note that Driverless AI requires 10 GB of free space in order to run and will stop working of less than 10 GB is available." }, { "output": " Click OK when you are done. .. image:: ../images/azure_vm_size.png\n :align: center\n\n6. On the Settings tab, select or create the Virtual Network and Subnet where the VM is going to be located and then click OK.\n\n .. image:: ../images/azure_settings_tab.png\n :align: center\n\n7. The Summary tab performs a validation on the specified settings and will report back any errors. When the validation passes successfully, click Create to create the VM. .. image:: ../images/azure_summary_tab.png\n :align: center\n\n8." }, { "output": " Select this Driverless AI VM to view the IP address of your newly created machine. 9. Connect to Driverless AI with your browser using the IP address retrieved in the previous step. .. code-block:: bash\n\n http://Your-Driverless-AI-Host-Machine:12345\n\n\nStopping the Azure Instance\n~\n\nThe Azure instance will continue to run even when you close the Azure portal. To stop the instance: \n\n1. Click the Virtual Machines left menu item. 2. Select the checkbox beside your DriverlessAI virtual machine. 3." }, { "output": " \nUpgrading the Driverless AI Community Image\n~\n\n.. include:: upgrade-warning.frag\n\nUpgrading from Version 1.2.2 or Earlier\n'\n\nThe following example shows how to upgrade from 1.2.2 or earlier to the current version. Upgrading from these earlier versions requires an edit to the ``start`` and ``h2oai`` scripts. 1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:\n\n .. code-block:: bash\n\n # Set up a directory of the previous version name\n mkdir dai_rel_1.2.2\n\n # Copy the data, log, license, and tmp directories as backup\n cp -a ./data dai_rel_1.2.2/data\n cp -a ./log dai_rel_1.2.2/log\n cp -a ./license dai_rel_1.2.2/license\n cp -a ./tmp dai_rel_1.2.2/tmp\n\n2." }, { "output": " The command below retrieves version 1.2.2:\n\n .. code-block:: bash\n\n wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.2.2-6/x86_64-centos7/dai-docker-centos7-x86_64-1.2.2-9.0.tar.gz\n\n3. In the /home/ubuntu/scripts/ folder, edit both the ``start.sh`` and ``h2oai.sh`` scripts to use the newer image. 4. Use the ``docker load`` command to load the image:\n\n .. code-block:: bash\n\n docker load < ami-0c50db5e1999408a7\n\n5. Optionally run ``docker images`` to ensure that the new image is in the registry." }, { "output": " Connect to Driverless AI with your browser at http://Your-Driverless-AI-Host-Machine:12345. Upgrading from Version 1.3.0 or Later\n\n\nThe following example shows how to upgrade from version 1.3.0. 1. SSH into the IP address of the image instance and copy the existing experiments to a backup location:\n\n .. code-block:: bash\n\n # Set up a directory of the previous version name\n mkdir dai_rel_1.3.0\n\n # Copy the data, log, license, and tmp directories as backup\n cp -a ./data dai_rel_1.3.0/data\n cp -a ./log dai_rel_1.3.0/log\n cp -a ./license dai_rel_1.3.0/license\n cp -a ./tmp dai_rel_1.3.0/tmp\n\n2." }, { "output": " .. _gbq:\n\nGoogle BigQuery Setup\n#####################\n\nDriverless AI lets you explore Google BigQuery (GBQ) data sources from within the Driverless AI application. This page provides instructions for configuring Driverless AI to work with GBQ. .. note::\n\tThe setup described on this page requires you to enable authentication. Enabling the GCS and/or GBQ connectors causes those file systems to be displayed in the UI, but the GCS and GBQ connectors cannot be used without first enabling authentication." }, { "output": " In the Google Cloud Platform (GCP), create a private key for your service account. To create a private key, click Service Accounts > Keys, and then click the Add Key button. When the Create private key dialog appears, select JSON as the key type. To finish creating the JSON private key and download it to your local file system, click Create. 2. Mount the downloaded JSON file to the Docker instance. 3. Specify the path to the downloaded and mounted ``auth-key.json`` file with the ``gcs_path_to_service_account_json`` config option." }, { "output": " Use ``docker version`` to check which version of Docker you are using. The following sections describe how to enable the GBQ data connector:\n\n- :ref:`gbq-config-toml`\n- :ref:`gbq-environment-variable`\n- :ref:`gbq-workload-identity`\n\n.. _gbq-config-toml:\n\nEnabling GBQ with the config.toml file\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the GBQ data connector with authentication by passing the JSON authentication file. This assumes that the JSON file contains Google BigQuery authentications." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options:\n\n - ``enabled_file_systems = \"file, upload, gbq\"``\n - ``gcs_path_to_service_account_json = \"/service_account_json.json\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the GBQ data connector with authentication by passing the JSON authentication file." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # file : local file system/server file system\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n enabled_file_systems = \"file, gbq\"\n\n # GCS Connector credentials\n # example (suggested) \"/licenses/my_service_account_json.json\"\n gcs_path_to_service_account_json = \"/service_account_json.json\"\n\n 3." }, { "output": " .. _gbq-environment-variable:\n\nEnabling GBQ by setting an environment variable\n*\n\nThe GBQ data connector can be configured by setting the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable as follows:\n\n::\n\n export GOOGLE_APPLICATION_CREDENTIALS=\"SERVICE_ACCOUNT_KEY_PATH\"\n\nIn the preceding example, replace ``SERVICE_ACCOUNT_KEY_PATH`` with the path of the JSON file that contains your service account key. The following is an example of how this might look:\n\n::\n\n export GOOGLE_APPLICATION_CREDENTIALS=\"/etc/dai/service-account.json\"\n\nTo see how to set this environment variable with Docker, refer to the following example:\n\n.. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n rm \\\n shm-size=256m \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,gbq\" \\\n -e GOOGLE_APPLICATION_CREDENTIALS=\"/service_account.json\" \\\n -u `id -u`:`id -g` \\\n -p 12345:12345 \\\n -v `pwd`/data:/data \\\n -v `pwd`/log:/log \\\n -v `pwd`/license:/license \\\n -v `pwd`/tmp:/tmp \\\n -v `pwd`/service_account_json.json:/service_account_json.json \\\n h2oai/dai-ubi8-x86_64:|tag|\n\nFor more information on setting the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable, refer to the `official documentation on setting the environment variable `_." }, { "output": " For information on how to enable Workload Identity, refer to the `official documentation on enabling Workload Identity on a GKE cluster `_. .. note::\n\tIf Workload Identity is enabled, then the ``GOOGLE_APPLICATION_CREDENTIALS`` environment variable does not need to be set. Adding Datasets Using GBQ\n*\n\nAfter Google BigQuery is enabled, you can add datasets by selecting Google Big Query from the Add Dataset (or Drag and Drop) drop-down menu." }, { "output": " .. figure:: ../images/add_dataset_dropdown.png\n :alt: Add Dataset\n :scale: 40\n\nSpecify the following information to add your dataset:\n\n1. Enter BQ Dataset ID with write access to create temporary table: Enter a dataset ID in Google BigQuery that this user has read/write access to. BigQuery uses this dataset as the location for the new table generated by the query. Note: Driverless AI's connection to GBQ will inherit the top-level directory from the service JSON file. So if a dataset named \"my-dataset\" is in a top-level directory named \"dai-gbq\", then the value for the dataset ID input field would be \"my-dataset\" and not \"dai-gbq:my-dataset\"." }, { "output": " Enter Google Storage destination bucket: Specify the name of Google Cloud Storage destination bucket. Note that the user must have write access to this bucket. 3. Enter Name for Dataset to be saved as: Specify a name for the dataset, for example, ``my_file``. 4. Enter BigQuery Query (Use StandardSQL): Enter a StandardSQL query that you want BigQuery to execute. For example: ``SELECT * FROM .``. 5. (Optional) Specify a project to use with the GBQ connector. This is equivalent to providing ``project`` when using a command-line interface." }, { "output": " Linux Docker Images\n-\n\nTo simplify local installation, Driverless AI is provided as a Docker image for the following system combinations:\n\n+-++-+-+\n| Host OS | Docker Version | Host Architecture | Min Mem |\n+=++=+=+\n| Ubuntu 16.04 or later | Docker CE | x86_64 | 64 GB |\n+-++-+-+\n| RHEL or CentOS 7.4 or later | Docker CE | x86_64 | 64 GB |\n+-++-+-+\n| NVIDIA DGX Registry | | x86_64 | |\n+-++-+-+\n\nNote: CUDA 11.2.2 or later with NVIDIA drivers >= |NVIDIA-driver-ver| is recommended (GPU only)." }, { "output": " For the best performance, including GPU support, use nvidia-docker. For a lower-performance experience without GPUs, use regular docker (with the same docker image). These installation steps assume that you have a license key for Driverless AI. For information on how to obtain a license key for Driverless AI, visit https://h2o.ai/o/try-driverless-ai/. Once obtained, you will be prompted to paste the license key into the Driverless AI UI when you first log in, or you can save it as a .sig file and place it in the \\license folder that you will create during the installation process." }, { "output": " \nThis section provides instructions for upgrading Driverless AI versions that were installed in a Docker container. These steps ensure that existing experiments are saved. WARNING: Experiments, MLIs, and MOJOs reside in the Driverless AI tmp directory and are not automatically upgraded when Driverless AI is upgraded. - Build MLI models before upgrading. - Build MOJO pipelines before upgrading. - Stop Driverless AI and make a backup of your Driverless AI tmp directory before upgrading. If you did not build MLI on a model before upgrading Driverless AI, then you will not be able to view MLI on that model after upgrading." }, { "output": " If that MLI job appears in the list of Interpreted Models in your current version, then it will be retained after upgrading. If you did not build a MOJO pipeline on a model before upgrading Driverless AI, then you will not be able to build a MOJO pipeline on that model after upgrading. Before upgrading, be sure to build MOJO pipelines on all desired models and then back up your Driverless AI tmp directory. Note: Stop Driverless AI if it is still running. Requirements\n\n\nWe recommend to have NVIDIA driver >= |NVIDIA-driver-ver| installed (GPU only) in your host environment for a seamless experience on all architectures, including Ampere." }, { "output": " Go to `NVIDIA download driver `__ to get the latest NVIDIA Tesla A/T/V/P/K series drivers. For reference on CUDA Toolkit and Minimum Required Driver Versions and CUDA Toolkit and Corresponding Driver Versions, see `here `__ . .. note::\n\tIf you are using K80 GPUs, the minimum required NVIDIA driver version is 450.80.02. Upgrade Steps\n'\n\n1. SSH into the IP address of the machine that is running Driverless AI." }, { "output": " Set up a directory for the version of Driverless AI on the host machine:\n\n .. code-block:: bash\n :substitutions:\n\n # Set up directory with the version name\n mkdir |VERSION-dir|\n\n # cd into the new directory\n cd |VERSION-dir|\n\n3. Retrieve the Driverless AI package from https://www.h2o.ai/download/ and add it to the new directory. 4. Load the Driverless AI Docker image inside the new directory:\n\n .. code-block:: bash\n :substitutions:\n\n # Load the Driverless AI docker image\n docker load < dai-docker-ubi8-x86_64-|VERSION-long|.tar.gz\n\n5." }, { "output": " Install the Driverless AI AWS Marketplace AMI\n-\n\nA Driverless AI AMI is available in the AWS Marketplace beginning with Driverless AI version 1.5.2. This section describes how to install and run Driverless AI through the AWS Marketplace. Environment\n~\n\n++-++-+\n| Provider | Instance Type | Num GPUs | Suitable for |\n++=++=+\n| AWS | p2.xlarge | 1 | Experimentation |\n| +-++-+\n| | p2.8xlarge | 8 | Serious use |\n| +-++-+\n| | p2.16xlarge | 16 | Serious use |\n| +-++-+\n| | p3.2xlarge | 1 | Experimentation |\n| +-++-+\n| | p3.8xlarge | 4 | Serious use |\n| +-++-+\n| | p3.16xlarge | 8 | Serious use |\n| +-++-+\n| | g3.4xlarge | 1 | Experimentation |\n| +-++-+\n| | g3.8xlarge | 2 | Experimentation |\n| +-++-+\n| | g3.16xlarge | 4 | Serious use |\n++-++-+\n\nInstallation Procedure\n\n\n1." }, { "output": " 2. Search for Driverless AI. .. figure:: ../images/aws-marketplace-search.png\n :alt: Search for Driverless AI\n\n3. Select the version of Driverless AI that you want to install. .. figure:: ../images/aws-marketplace-versions.png\n :alt: Select version\n\n4. Scroll down to review/edit your region and the selected infrastructure and pricing. .. figure:: ../images/aws-marketplace-pricing-info.png\n :alt: Review pricing \n\n5. Return to the top and select Continue to Subscribe. .. figure:: ../images/aws-marketplace-continue-to-subscribe.png\n :alt: Continue to subscribe\n\n6. Review the subscription, then click Continue to Configure." }, { "output": " If desired, change the Fullfillment Option, Software Version, and Region. Note that this page also includes the AMI ID for the selected software version. Click Continue to Launch when you are done. .. figure:: ../images/aws-marketplace-configure-software.png\n :alt: Configure the software\n\n8. Review the configuration and choose a method for launching Driverless AI. Click the Usage Instructions button in AWS to review your Driverless AI username and password. Scroll down to the bottom of the page and click Launch when you are done." }, { "output": " .. figure:: ../images/aws-marketplace-success.png\n :alt: Success message\n\n\nStarting Driverless AI\n\n\nThis section describes how to start Driverless AI after the Marketplace AMI has been successfully launched. 1. Navigate to the `EC2 Console `__. 2. Select your instance. 3. Open another browser and launch Driverless AI by navigating to https://:12345. 4. Sign in to Driverless AI with the username h2oai and use the AWS InstanceID as the password." }, { "output": " Stopping the EC2 Instance\n~\n\nThe EC2 instance will continue to run even when you close the aws.amazon.com portal. To stop the instance: \n\n1. On the EC2 Dashboard, click the Running Instances link under the Resources section. 2. Select the instance that you want to stop. 3. In the Actions drop down menu, select Instance State > Stop. 4. A confirmation page will display. Click Yes, Stop to stop the instance. Upgrading the Driverless AI Marketplace Image\n\n\nNote that the first offering of the Driverless AI Marketplace image was 1.5.2." }, { "output": " Perform the following steps if you are upgrading to a Driverless AI Marketeplace image version greater than 1.5.2. Replace ``dai_NEWVERSION.deb`` below with the new Driverless AI version (for example, ``dai_1.5.4_amd64.deb``). Note that this upgrade process inherits the service user and group from /etc/dai/User.conf and /etc/dai/Group.conf. You do not need to manually specify the DAI_USER or DAI_GROUP environment variables during an upgrade. .. code-block:: bash\n\n # Stop Driverless AI. sudo systemctl stop dai\n\n # Make a backup of /opt/h2oai/dai/tmp directory at this time." }, { "output": " kdb+ Setup\n\n\nDriverless AI lets you explore `kdb+ `__ data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with kdb+. Note: Depending on your Docker install version, use either the ``docker run runtime=nvidia`` (>= Docker 19.03) or ``nvidia-docker`` (< Docker 19.03) command when starting the Driverless AI Docker image. Use ``docker version`` to check which version of Docker you are using. Description of Configuration Attributes\n~\n\n- ``kdb_user``: (Optional) User name \n- ``kdb_password``: (Optional) User's password\n- ``kdb_hostname``: IP address or host of the KDB server\n- ``kdb_port``: Port on which the kdb+ server is listening\n- ``kdb_app_jvm_args``: (Optional) JVM args for kdb+ distributions (for example, ``-Dlog4j.configuration``)." }, { "output": " - ``kdb_app_classpath``: (Optional) The kdb+ classpath (or other if the jar file is stored elsewhere). - ``enabled_file_systems``: The file systems you want to enable. This must be configured in order for data connectors to function properly. Example 1: Enable kdb+ with No Authentication\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the kdb+ connector without authentication. The only required flags are the hostname and the port. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,kdb\" \\\n -e DRIVERLESS_AI_KDB_HOSTNAME=\"\" \\\n -e DRIVERLESS_AI_KDB_PORT=\"\" \\\n -p 12345:12345 \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to configure kdb+ options in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, kdb\"``\n - ``kdb_hostname = \"``\n - ``kdb_port = \"\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the kdb+ connector without authentication." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, kdb\"\n\n # KDB Connector credentials\n kdb_hostname = \"\n kdb_port = \"\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI. Example 2: Enable kdb+ with Authentication\n\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example provides users credentials for accessing a kdb+ server from Driverless AI." }, { "output": " Note that this example enables kdb+ with no authentication. 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, kdb\"``\n - ``kdb_user = \"\"``\n - ``kdb_password = \"\"``\n - ``kdb_hostname = \"``\n - ``kdb_port = \"\"``\n - ``kdb_app_classpath = \"\"``\n - ``kdb_app_jvm_args = \"\"``\n\n 2. Mount the config.toml file into the Docker container." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." }, { "output": " (hive_app_configs)\n # recipe_url: load custom recipe from URL\n # recipe_file: load custom recipe from local file system\n enabled_file_systems = \"file, kdb\"\n\n # kdb+ Connector credentials\n kdb_user = \"\"\n kdb_password = \"\"\n kdb_hostname = \"\n kdb_port = \"\"\n kdb_app_classpath = \"\"\n kdb_app_jvm_args = \"\"\n\n 3. Save the changes when you are done, then stop/restart Driverless AI." }, { "output": " .. figure:: ../images/add_dataset_dropdown.png\n :alt: Add Dataset\n :height: 338\n :width: 237\n\nSpecify the following information to add your dataset. 1. Enter filepath to save query. Enter the local file path for storing your dataset. For example, /home//myfile.csv. Note that this can only be a CSV file. 2. Enter KDB Query: Enter a kdb+ query that you want to execute. Note that the connector will accept any `q qeuries `__. For example: ``select from `` or `` lj ``\n\n3." }, { "output": " Data Recipe File Setup\n\n\nDriverless AI lets you explore data recipe file data sources from within the Driverless AI application. This section provides instructions for configuring Driverless AI to work with local data recipe files. When enabled (default), you will be able to modify datasets that have been added to Driverless AI. (Refer to :ref:`modify_by_recipe` for more information.) Notes:\n\n- This connector is enabled by default. These steps are provided in case this connector was previously disabled and you want to re-enable it." }, { "output": " Use ``docker version`` to check which version of Docker you are using. Enable Data Recipe File\n~\n\n.. tabs::\n .. group-tab:: Docker Image Installs\n\n This example enables the data recipe file data connector. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_ENABLED_FILE_SYSTEMS=\"file,upload,recipe_file\" \\\n -p 12345:12345 \\\n init -it rm \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Docker Image with the config.toml\n\n This example shows how to enable the Upload Data Recipe connector in the config.toml file, and then specify that file when starting Driverless AI in Docker." }, { "output": " 1. Configure the Driverless AI config.toml file. Set the following configuration options. - ``enabled_file_systems = \"file, upload, recipe_file\"``\n\n 2. Mount the config.toml file into the Docker container. .. code-block:: bash\n :substitutions:\n\n nvidia-docker run \\\n pid=host \\\n init \\\n rm \\\n shm-size=256m \\\n add-host name.node:172.16.2.186 \\\n -e DRIVERLESS_AI_CONFIG_FILE=/path/in/docker/config.toml \\\n -p 12345:12345 \\\n -v /local/path/to/config.toml:/path/in/docker/config.toml \\\n -v /etc/passwd:/etc/passwd:ro \\\n -v /etc/group:/etc/group:ro \\\n -v /tmp/dtmp/:/tmp \\\n -v /tmp/dlog/:/log \\\n -v /tmp/dlicense/:/license \\\n -v /tmp/ddata/:/data \\\n -u $(id -u):$(id -g) \\\n h2oai/dai-ubi8-x86_64:|tag|\n\n .. group-tab:: Native Installs\n\n This example enables the Upload Data Recipe data connector." }, { "output": " 1. Export the Driverless AI config.toml file or add it to ~/.bashrc. For example:\n\n ::\n\n # DEB and RPM\n export DRIVERLESS_AI_CONFIG_FILE=\"/etc/dai/config.toml\"\n\n # TAR SH\n export DRIVERLESS_AI_CONFIG_FILE=\"/path/to/your/unpacked/dai/directory/config.toml\" \n\n 2. Specify the following configuration options in the config.toml file. ::\n\n # File System Support\n # upload : standard upload feature\n # file : local file system/server file system\n # hdfs : Hadoop file system, remember to configure the HDFS config folder path and keytab below\n # dtap : Blue Data Tap file system, remember to configure the DTap section below\n # s3 : Amazon S3, optionally configure secret and access key below\n # gcs : Google Cloud Storage, remember to configure gcs_path_to_service_account_json below\n # gbq : Google Big Query, remember to configure gcs_path_to_service_account_json below\n # minio : Minio Cloud Storage, remember to configure secret and access key below\n # snow : Snowflake Data Warehouse, remember to configure Snowflake credentials below (account name, username, password)\n # kdb : KDB+ Time Series Database, remember to configure KDB credentials below (hostname and port, optionally: username, password, classpath, and jvm_args)\n # azrbs : Azure Blob Storage, remember to configure Azure credentials below (account name, account key)\n # jdbc: JDBC Connector, remember to configure JDBC below." } ]