Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

retkowski commited on Mar 22

Commit

4f73702

•

1 Parent(s): 9383286

remove non-english examples

Browse files

Files changed (12) hide show

demo_data/nips-2021/25953/metadata.json +0 -3
demo_data/nips-2021/25953/transcript_whisper_large-v2.txt +0 -193
demo_data/nips-2021/25953/transcript_whisper_large-v2.vtt +0 -581
demo_data/nips-2021/25953/video.mp4 +0 -3
demo_data/nips-2021/25962/metadata.json +0 -3
demo_data/nips-2021/25962/transcript_whisper_large-v2.txt +0 -51
demo_data/nips-2021/25962/transcript_whisper_large-v2.vtt +0 -155
demo_data/nips-2021/25962/video.mp4 +0 -3
demo_data/nips-2021/25964/metadata.json +0 -3
demo_data/nips-2021/25964/transcript_whisper_large-v2.txt +0 -366
demo_data/nips-2021/25964/transcript_whisper_large-v2.vtt +0 -1100
demo_data/nips-2021/25964/video.mp4 +0 -3

demo_data/nips-2021/25953/metadata.json DELETED Viewed

@@ -1,3 +0,0 @@
-{
-    "title": "Sliced Mutual Information: A Scalable Measure of Statistical Dependence"
-}

demo_data/nips-2021/25953/transcript_whisper_large-v2.txt DELETED Viewed

@@ -1,193 +0,0 @@
-Hi everyone, my name is Zyw Goldfeld and this is a joint work with Christian Greenwald about
-sliced mutual information, which is a new measure of statistical dependence that has
-some nice scalability properties to high dimensional settings.
-And to get started, I think we're all familiar with classic mutual information that is defined
-between let's say continuous high dimensional random variables, which is the regime that
-we'll mostly be interested in, like SOH, basically the KL divergence between their joint distributions
-and the product of their marginals.
-And mutual information is indeed this fundamental measure of dependence that enjoys many good
-properties such that the fact that it nullifies if and only if our random variables are independent,
-it is invariant to bijections and it meets several useful representations, decompositions,
-variational forms, etc.
-And in fact, it can be even obtained axiomatically as the unique functional of the joint distribution
-that satisfies some natural informativeness conditions.
-And as such, mutual information has seen a variety of applications in information theory
-and statistics more recently in machine learning.
-But the problem is that all this nice structure comes with a hefty price, since computing
-mutual information in high dimensions or estimating it from samples is very, very hard, effectively
-infeasible.
-And this is the so-called curse of dimensionality and sort of the problem that we try to tackle
-in this work.
-And to address this difficulty, what we propose is sliced mutual information, which is, like
-I said, a new measure of statistical dependence, not necessarily a proxy of mutual information
-as such, but rather an alternative notion, which is defined as this average of scalar
-mutual information terms between projections of our high dimensional variables onto randomly
-chosen directions from the corresponding unit spheres.
-And it's of course inspired by the recent popularization of slicing techniques for statistical
-divergences, in particular the Wasserstein, the sliced Wasserstein distance is a great
-example.
-But the way it works for sliced mutual information is roughly so, well, let's say that this is
-our first high dimensional variable X and this is its distribution.
-What you do is draw a projection direction uniformly from the sphere.
-You then project this random variable onto that direction, do the same for your other
-random variable.
-And now for these two projected scalar new variables, we just compute the mutual information
-between them and average everything over the choice of direction.
-So that's basically the definition.
-And with that, the goal of this work is effectively to show that sliced mutual information is
-both a meaningful and a scalable mutual information alternative.
-Meaningful, well, in the sense that it preserves many of the desired properties that make mutual
-information appealing to begin with and scalable in the sense that it alleviates the set of
-computational and statistical difficulties.
-All right.
-Yeah, and to address this first point, let me show you that, well, despite those one
-dimensional projections, sliced mutual information indeed inherits many of the properties of
-classic mutual information.
-So we have, well, of course, non-negativity, but furthermore, identification of independence.
-We have an entropy decomposition for an appropriate definition of sliced entropy.
-We can represent it as a KL divergence, a sliced KL divergence.
-To be more precise, we have a chain rule tensorization for independent copies, as well as a Donsker-Varadhan-like
-variational form that can be readily used for neural estimation of sliced mutual information.
-We actually make use of that in some of our empirical results.
-And well, I mean, you are more than welcome to check the paper or visit us as a poster
-if you want to know more about any of these.
-But really, the upshot here is that much of the classic structure is still there after
-the slicing.
-Now another interesting feature of sliced mutual information comes to light when you
-think of it in the context of the famous data processing inequality.
-And for starters, recall that classic mutual information satisfies the DPI, which in particular
-means that if you process either of your random variables with a deterministic function, say
-this f over here, you can only lose the informativeness in the classic sense.
-Now sliced mutual information plays differently with processing and can in some sense benefit
-from nice transformations that, let's say, give rise to some nicer manifold for your
-random variable.
-And to understand this, keep in mind that, well, first of all, sliced mutual information
-only looks at projections of random variables.
-And it may very well be the case that some transformations of x, let's say, have more
-informative projections about y than x itself.
-And here's a simple example to that effect.
-So consider a two-dimensional isotropic Gaussian x, so two coordinates, x1 and x2.
-And let's take y to be, for example, its first coordinate.
-Now if you look at the mutual information between two fixed projections of x and y,
-well, projection does nothing to y, right, because it's a scalar.
-But it does affect x.
-And if you look at the mutual information between two projections of x and y, you quickly
-realize that x1 really plays the role of the signal here, whereas x2 behaves like noise.
-And therefore, any transformation that will effectively improve your signal-to-noise ratio,
-for example, like this g sub a over here, where a is less than 1, will indeed give rise
-to a higher sliced mutual information value.
-So all in all, sliced mutual information can be increased from processing, which means
-that, well, in particular, it validates the data processing inequality and is different
-from classic mutual information in that sense.
-But interestingly, and as I will show you shortly, this is actually a quite useful thing
-to have, for example, for feature extraction tasks, because we can use sliced mutual information
-effectively to maximize it in order to extract informative features and land on those nicer
-manifolds that I mentioned a moment ago.
-And here's an example theorem that kind of makes this statement precise or formal, where
-we consider the maximization of sliced mutual information over linear transformations of
-our random variables.
-And this would, of course, not affect classic mutual information at all.
-But what we can show is that for sliced mutual information, this maximization ends up extracting
-the two most informative projection directions for you, which in particular will be encoded
-in the optimizing matrices, these A sub x star and A sub y star.
-And of course, there's nothing special about this particular setup.
-And we can establish similar results for, well, first of all, rank-constrained matrices
-that as opposed to what's shown here would extract the, let's say, our most informative
-features or projection directions.
-In the paper, we also extend this result to shallow neural networks.
-And in fact, our argument can be easily extended to cover additional nonlinear cases as well.
-OK, so that's pretty much for structural properties.
-But like I said at the beginning, the real premise of this framework is overcoming the
-curse of dimensionality.
-And let me show you that this is indeed the case, that sliced mutual information is or
-can be estimated in a scalable manner, effectively by combining your favorite scalar mutual information
-estimator with a simple Monte Carlo average step.
-And this is how it works.
-So let's say we're giving n IID samples from our high-dimensional random variables.
-And we're further given a scalar mutual information estimator that achieves, say, error delta
-of n when applied to n IID samples of some pair of one-dimensional variables, a and b.
-OK, so let's say we have these.
-Now, to estimate sliced mutual information, first thing to do is sample, let's say, m
-random projections from the corresponding spheres in an IID fashion, at which point
-we will take our high-dimensional n samples and project them onto each of these m random
-projections that we've generated.
-And the thing to observe here is that the resulting n times n data set of these projections
-is nothing but IID samples from the corresponding projected distribution, which is the right
-thing to have here if what you're trying to estimate is sliced mutual information.
-So having that, I mean, at this point, per projection direction, we can apply the scalar
-mutual information estimator and then just take one big, happy Monte Carlo average of
-the entire thing over the different projection directions.
-And this would give rise to the proposed sliced mutual information estimator.
-Now, you can compute this thing very easily, because at the end of the day, it's an average
-of scalar mutual information estimates.
-And as far as performance guarantees, we can show that so long that the per-sliced mutual
-information is bounded, the uniform absolute error of this estimator scales like 1 over
-the root of m, the number of our Monte Carlo samples, plus the error of the scalar mutual
-information estimator.
-And I'm just restating this informally over here.
-And what this all in all shows is that sliced mutual information can therefore be estimated
-the rate of scalar mutual information estimation problem plus this m to the minus half Monte
-Carlo penalty.
-And the thing is that under appropriate smoothness assumptions, the one-dimensional rate is in
-fact parametric.
-And therefore, if you just match the size of your data set and the number of Monte Carlo
-samples, just equate n and m, the sliced mutual information between high-dimensional variables
-can be estimated at the parametric n to the minus half rate, perhaps up to some logarithmic
-factors.
-And this is, of course, a significant speed up and stands in sharp contrast to the slow,
-exponentially bad in dimension, curse of dimensionality rate for classic mutual information.
-Yeah, now this scalability makes, in fact, running empirical experiments with sliced
-mutual information quite a breeze.
-So let me quickly show you some sort of proof of concept experiments, let's say.
-And the first one just relies on the fact that, well, SMI, sliced mutual information
-can identify independence.
-And therefore, we examine it as a figure of merit for independence testing, basically
-by thresholding the computed sliced mutual information value.
-And the results that we have obtained, of course, we've compared them with the same
-test, but based on classic mutual information.
-And this figure over here shows that for a bunch of different settings, well, it presents
-the area under the ROC curve as a function of the number of samples, the standard way
-to represent the quality of an independence test.
-And you basically want this number to be 1, which corresponds to an omniscient test.
-And what we observe is that sliced mutual information performs consistently well across
-different setups and across different dimensions, whereas the performance of the mutual information,
-the classic mutual information-based test, quickly degrades as dimension grows.
-Now, on top of that, let me also demonstrate how sliced mutual information can be used
-for feature extraction.
-And here, what we want to do is maximize the sliced mutual information between linear transformations
-of x and y that are now chosen to be IID samples from the same MNIST class, which we restrict
-to be either 0 or 1.
-And the choice of class is also random, so basically just a fair coin flip.
-And by observing that sliced mutual information between x and y is at most 1 bit, I mean,
-it's always upper bounded by mutual information, which equals a single bit in this case, basically
-the class label, the way to understand what we're doing here is that we're looking for
-the linear feature that is most informative for classifying or determining this class
-label.
-And interestingly enough, this is what this procedure ends up learning, where the figure
-shows basically the first two rows of the optimal A matrix that we obtained, rearranged
-in the dimension of an MNIST image.
-And this really looks like a match filter, if you're familiar, which, when applied to
-the samples, would indeed be able to tell you whether the sample came from the 0 class
-or not.
-And as far as for the value itself, well, the maximized sliced mutual information value
-ends up being roughly 0.7, which is quite close to the 1 bit upper bound, and is much,
-much larger than what you would get if you would not learn A, and let's say just instantiate
-it as a matrix with IID entries drawn according to some distribution.
-And this is just to say that something meaningful indeed being learned here, and something meaningful
-indeed happens when you maximize the sliced mutual information as your optimization objective.
-OK, so yeah, that's basically it.
-And just to recap, we introduced sliced mutual information, which is this average of scalar
-mutual information terms between one-dimensional projections.
-We've seen that it preserves much of the structure of classic mutual information.
-It can be efficiently computed and estimated from samples, and can also be, in fact, increased
-by our processing if, indeed, your processing gives rise to more informative projections.
-And we've presented some proof of concept applications to independence testing, to feature
-extraction.
-We have a couple of more in the paper.
-But let me say this.
-While this is mostly theoretical work, and a large-scale empirical exploration is sort
-of beyond its scope, we firmly believe that sliced mutual information will be extremely
-useful for various such tasks, and are very excited to look into this in the future.
-And yeah, with that, I'll stop.
-Thank you guys for listening, and do visit us at the poster, and check out the paper
-if you would like to know more.

demo_data/nips-2021/25953/transcript_whisper_large-v2.vtt DELETED Viewed

@@ -1,581 +0,0 @@
-WEBVTT
-00:00.000 --> 00:13.140
-Hi everyone, my name is Zyw Goldfeld and this is a joint work with Christian Greenwald about
-00:13.140 --> 00:18.200
-sliced mutual information, which is a new measure of statistical dependence that has
-00:18.200 --> 00:22.520
-some nice scalability properties to high dimensional settings.
-00:22.520 --> 00:26.540
-And to get started, I think we're all familiar with classic mutual information that is defined
-00:26.540 --> 00:30.920
-between let's say continuous high dimensional random variables, which is the regime that
-00:30.920 --> 00:36.240
-we'll mostly be interested in, like SOH, basically the KL divergence between their joint distributions
-00:36.240 --> 00:39.040
-and the product of their marginals.
-00:39.040 --> 00:44.520
-And mutual information is indeed this fundamental measure of dependence that enjoys many good
-00:44.520 --> 00:50.060
-properties such that the fact that it nullifies if and only if our random variables are independent,
-00:50.060 --> 00:55.200
-it is invariant to bijections and it meets several useful representations, decompositions,
-00:55.200 --> 00:56.600
-variational forms, etc.
-00:56.600 --> 01:02.440
-And in fact, it can be even obtained axiomatically as the unique functional of the joint distribution
-01:02.440 --> 01:07.760
-that satisfies some natural informativeness conditions.
-01:07.760 --> 01:11.120
-And as such, mutual information has seen a variety of applications in information theory
-01:11.120 --> 01:13.680
-and statistics more recently in machine learning.
-01:13.680 --> 01:18.920
-But the problem is that all this nice structure comes with a hefty price, since computing
-01:18.920 --> 01:24.520
-mutual information in high dimensions or estimating it from samples is very, very hard, effectively
-01:24.520 --> 01:25.520
-infeasible.
-01:25.520 --> 01:30.240
-And this is the so-called curse of dimensionality and sort of the problem that we try to tackle
-01:30.240 --> 01:31.400
-in this work.
-01:31.400 --> 01:37.040
-And to address this difficulty, what we propose is sliced mutual information, which is, like
-01:37.040 --> 01:42.520
-I said, a new measure of statistical dependence, not necessarily a proxy of mutual information
-01:42.520 --> 01:48.820
-as such, but rather an alternative notion, which is defined as this average of scalar
-01:48.820 --> 01:54.640
-mutual information terms between projections of our high dimensional variables onto randomly
-01:54.640 --> 01:58.520
-chosen directions from the corresponding unit spheres.
-01:58.520 --> 02:03.520
-And it's of course inspired by the recent popularization of slicing techniques for statistical
-02:03.520 --> 02:07.480
-divergences, in particular the Wasserstein, the sliced Wasserstein distance is a great
-02:07.480 --> 02:08.480
-example.
-02:08.480 --> 02:14.440
-But the way it works for sliced mutual information is roughly so, well, let's say that this is
-02:14.440 --> 02:19.120
-our first high dimensional variable X and this is its distribution.
-02:19.120 --> 02:22.480
-What you do is draw a projection direction uniformly from the sphere.
-02:22.480 --> 02:26.960
-You then project this random variable onto that direction, do the same for your other
-02:26.960 --> 02:28.200
-random variable.
-02:28.200 --> 02:34.360
-And now for these two projected scalar new variables, we just compute the mutual information
-02:34.360 --> 02:38.560
-between them and average everything over the choice of direction.
-02:38.560 --> 02:40.600
-So that's basically the definition.
-02:40.600 --> 02:45.880
-And with that, the goal of this work is effectively to show that sliced mutual information is
-02:45.880 --> 02:50.080
-both a meaningful and a scalable mutual information alternative.
-02:50.080 --> 02:56.200
-Meaningful, well, in the sense that it preserves many of the desired properties that make mutual
-02:56.200 --> 03:00.240
-information appealing to begin with and scalable in the sense that it alleviates the set of
-03:00.240 --> 03:03.800
-computational and statistical difficulties.
-03:03.800 --> 03:04.800
-All right.
-03:04.800 --> 03:11.080
-Yeah, and to address this first point, let me show you that, well, despite those one
-03:11.080 --> 03:15.800
-dimensional projections, sliced mutual information indeed inherits many of the properties of
-03:15.800 --> 03:17.700
-classic mutual information.
-03:17.700 --> 03:23.740
-So we have, well, of course, non-negativity, but furthermore, identification of independence.
-03:23.740 --> 03:28.960
-We have an entropy decomposition for an appropriate definition of sliced entropy.
-03:28.960 --> 03:31.840
-We can represent it as a KL divergence, a sliced KL divergence.
-03:31.840 --> 03:38.920
-To be more precise, we have a chain rule tensorization for independent copies, as well as a Donsker-Varadhan-like
-03:38.920 --> 03:44.840
-variational form that can be readily used for neural estimation of sliced mutual information.
-03:44.840 --> 03:49.720
-We actually make use of that in some of our empirical results.
-03:49.720 --> 03:53.400
-And well, I mean, you are more than welcome to check the paper or visit us as a poster
-03:53.400 --> 03:55.280
-if you want to know more about any of these.
-03:55.280 --> 04:00.480
-But really, the upshot here is that much of the classic structure is still there after
-04:00.480 --> 04:02.360
-the slicing.
-04:02.360 --> 04:06.240
-Now another interesting feature of sliced mutual information comes to light when you
-04:06.240 --> 04:10.400
-think of it in the context of the famous data processing inequality.
-04:10.400 --> 04:15.560
-And for starters, recall that classic mutual information satisfies the DPI, which in particular
-04:15.560 --> 04:21.440
-means that if you process either of your random variables with a deterministic function, say
-04:21.440 --> 04:27.400
-this f over here, you can only lose the informativeness in the classic sense.
-04:27.400 --> 04:33.360
-Now sliced mutual information plays differently with processing and can in some sense benefit
-04:33.360 --> 04:39.280
-from nice transformations that, let's say, give rise to some nicer manifold for your
-04:39.280 --> 04:40.280
-random variable.
-04:40.280 --> 04:43.880
-And to understand this, keep in mind that, well, first of all, sliced mutual information
-04:43.880 --> 04:47.320
-only looks at projections of random variables.
-04:47.320 --> 04:52.720
-And it may very well be the case that some transformations of x, let's say, have more
-04:52.720 --> 04:58.480
-informative projections about y than x itself.
-04:58.480 --> 05:01.080
-And here's a simple example to that effect.
-05:01.080 --> 05:06.120
-So consider a two-dimensional isotropic Gaussian x, so two coordinates, x1 and x2.
-05:06.120 --> 05:10.440
-And let's take y to be, for example, its first coordinate.
-05:10.440 --> 05:15.440
-Now if you look at the mutual information between two fixed projections of x and y,
-05:15.440 --> 05:18.600
-well, projection does nothing to y, right, because it's a scalar.
-05:18.600 --> 05:20.400
-But it does affect x.
-05:20.400 --> 05:24.520
-And if you look at the mutual information between two projections of x and y, you quickly
-05:24.520 --> 05:31.120
-realize that x1 really plays the role of the signal here, whereas x2 behaves like noise.
-05:31.120 --> 05:36.120
-And therefore, any transformation that will effectively improve your signal-to-noise ratio,
-05:36.120 --> 05:42.520
-for example, like this g sub a over here, where a is less than 1, will indeed give rise
-05:42.520 --> 05:45.880
-to a higher sliced mutual information value.
-05:45.880 --> 05:50.300
-So all in all, sliced mutual information can be increased from processing, which means
-05:50.300 --> 05:54.440
-that, well, in particular, it validates the data processing inequality and is different
-05:54.440 --> 05:56.840
-from classic mutual information in that sense.
-05:56.840 --> 06:03.120
-But interestingly, and as I will show you shortly, this is actually a quite useful thing
-06:03.120 --> 06:08.400
-to have, for example, for feature extraction tasks, because we can use sliced mutual information
-06:08.400 --> 06:14.240
-effectively to maximize it in order to extract informative features and land on those nicer
-06:14.240 --> 06:17.660
-manifolds that I mentioned a moment ago.
-06:17.660 --> 06:22.280
-And here's an example theorem that kind of makes this statement precise or formal, where
-06:22.280 --> 06:28.120
-we consider the maximization of sliced mutual information over linear transformations of
-06:28.120 --> 06:29.920
-our random variables.
-06:29.920 --> 06:34.200
-And this would, of course, not affect classic mutual information at all.
-06:34.200 --> 06:39.160
-But what we can show is that for sliced mutual information, this maximization ends up extracting
-06:39.160 --> 06:44.960
-the two most informative projection directions for you, which in particular will be encoded
-06:44.960 --> 06:52.200
-in the optimizing matrices, these A sub x star and A sub y star.
-06:52.200 --> 06:55.240
-And of course, there's nothing special about this particular setup.
-06:55.240 --> 07:00.720
-And we can establish similar results for, well, first of all, rank-constrained matrices
-07:00.720 --> 07:06.720
-that as opposed to what's shown here would extract the, let's say, our most informative
-07:06.720 --> 07:08.840
-features or projection directions.
-07:08.840 --> 07:11.120
-In the paper, we also extend this result to shallow neural networks.
-07:11.120 --> 07:17.840
-And in fact, our argument can be easily extended to cover additional nonlinear cases as well.
-07:17.840 --> 07:21.440
-OK, so that's pretty much for structural properties.
-07:21.440 --> 07:25.400
-But like I said at the beginning, the real premise of this framework is overcoming the
-07:25.400 --> 07:26.400
-curse of dimensionality.
-07:26.400 --> 07:32.640
-And let me show you that this is indeed the case, that sliced mutual information is or
-07:32.640 --> 07:38.640
-can be estimated in a scalable manner, effectively by combining your favorite scalar mutual information
-07:38.640 --> 07:42.200
-estimator with a simple Monte Carlo average step.
-07:42.200 --> 07:43.480
-And this is how it works.
-07:43.480 --> 07:48.260
-So let's say we're giving n IID samples from our high-dimensional random variables.
-07:48.260 --> 07:53.400
-And we're further given a scalar mutual information estimator that achieves, say, error delta
-07:53.400 --> 08:00.240
-of n when applied to n IID samples of some pair of one-dimensional variables, a and b.
-08:00.240 --> 08:02.040
-OK, so let's say we have these.
-08:02.040 --> 08:08.760
-Now, to estimate sliced mutual information, first thing to do is sample, let's say, m
-08:08.760 --> 08:14.680
-random projections from the corresponding spheres in an IID fashion, at which point
-08:14.680 --> 08:22.400
-we will take our high-dimensional n samples and project them onto each of these m random
-08:22.400 --> 08:24.960
-projections that we've generated.
-08:24.960 --> 08:30.780
-And the thing to observe here is that the resulting n times n data set of these projections
-08:30.780 --> 08:35.220
-is nothing but IID samples from the corresponding projected distribution, which is the right
-08:35.220 --> 08:39.400
-thing to have here if what you're trying to estimate is sliced mutual information.
-08:39.400 --> 08:43.860
-So having that, I mean, at this point, per projection direction, we can apply the scalar
-08:43.860 --> 08:49.400
-mutual information estimator and then just take one big, happy Monte Carlo average of
-08:49.400 --> 08:52.040
-the entire thing over the different projection directions.
-08:52.040 --> 08:55.600
-And this would give rise to the proposed sliced mutual information estimator.
-08:55.600 --> 08:59.780
-Now, you can compute this thing very easily, because at the end of the day, it's an average
-08:59.780 --> 09:03.000
-of scalar mutual information estimates.
-09:03.000 --> 09:09.120
-And as far as performance guarantees, we can show that so long that the per-sliced mutual
-09:09.120 --> 09:15.840
-information is bounded, the uniform absolute error of this estimator scales like 1 over
-09:15.840 --> 09:22.240
-the root of m, the number of our Monte Carlo samples, plus the error of the scalar mutual
-09:22.240 --> 09:23.240
-information estimator.
-09:23.240 --> 09:26.520
-And I'm just restating this informally over here.
-09:26.520 --> 09:31.240
-And what this all in all shows is that sliced mutual information can therefore be estimated
-09:31.240 --> 09:37.760
-the rate of scalar mutual information estimation problem plus this m to the minus half Monte
-09:37.760 --> 09:38.760
-Carlo penalty.
-09:38.760 --> 09:43.440
-And the thing is that under appropriate smoothness assumptions, the one-dimensional rate is in
-09:43.440 --> 09:45.200
-fact parametric.
-09:45.200 --> 09:49.720
-And therefore, if you just match the size of your data set and the number of Monte Carlo
-09:49.720 --> 09:54.640
-samples, just equate n and m, the sliced mutual information between high-dimensional variables
-09:54.640 --> 09:59.360
-can be estimated at the parametric n to the minus half rate, perhaps up to some logarithmic
-09:59.360 --> 10:00.360
-factors.
-10:00.360 --> 10:06.360
-And this is, of course, a significant speed up and stands in sharp contrast to the slow,
-10:06.360 --> 10:12.040
-exponentially bad in dimension, curse of dimensionality rate for classic mutual information.
-10:12.040 --> 10:17.200
-Yeah, now this scalability makes, in fact, running empirical experiments with sliced
-10:17.200 --> 10:18.720
-mutual information quite a breeze.
-10:18.720 --> 10:24.160
-So let me quickly show you some sort of proof of concept experiments, let's say.
-10:24.160 --> 10:28.280
-And the first one just relies on the fact that, well, SMI, sliced mutual information
-10:28.280 --> 10:29.840
-can identify independence.
-10:29.840 --> 10:34.440
-And therefore, we examine it as a figure of merit for independence testing, basically
-10:34.440 --> 10:38.640
-by thresholding the computed sliced mutual information value.
-10:38.640 --> 10:42.000
-And the results that we have obtained, of course, we've compared them with the same
-10:42.000 --> 10:45.360
-test, but based on classic mutual information.
-10:45.360 --> 10:50.320
-And this figure over here shows that for a bunch of different settings, well, it presents
-10:50.320 --> 10:55.040
-the area under the ROC curve as a function of the number of samples, the standard way
-10:55.040 --> 10:59.160
-to represent the quality of an independence test.
-10:59.160 --> 11:02.920
-And you basically want this number to be 1, which corresponds to an omniscient test.
-11:02.920 --> 11:07.520
-And what we observe is that sliced mutual information performs consistently well across
-11:07.520 --> 11:13.080
-different setups and across different dimensions, whereas the performance of the mutual information,
-11:13.080 --> 11:18.280
-the classic mutual information-based test, quickly degrades as dimension grows.
-11:18.280 --> 11:23.280
-Now, on top of that, let me also demonstrate how sliced mutual information can be used
-11:23.280 --> 11:24.680
-for feature extraction.
-11:24.680 --> 11:29.780
-And here, what we want to do is maximize the sliced mutual information between linear transformations
-11:29.780 --> 11:37.160
-of x and y that are now chosen to be IID samples from the same MNIST class, which we restrict
-11:37.160 --> 11:39.240
-to be either 0 or 1.
-11:39.240 --> 11:42.840
-And the choice of class is also random, so basically just a fair coin flip.
-11:42.840 --> 11:47.280
-And by observing that sliced mutual information between x and y is at most 1 bit, I mean,
-11:47.280 --> 11:52.560
-it's always upper bounded by mutual information, which equals a single bit in this case, basically
-11:52.560 --> 11:57.320
-the class label, the way to understand what we're doing here is that we're looking for
-11:57.320 --> 12:03.400
-the linear feature that is most informative for classifying or determining this class
-12:03.400 --> 12:04.760
-label.
-12:04.760 --> 12:08.200
-And interestingly enough, this is what this procedure ends up learning, where the figure
-12:08.200 --> 12:15.040
-shows basically the first two rows of the optimal A matrix that we obtained, rearranged
-12:15.040 --> 12:17.480
-in the dimension of an MNIST image.
-12:17.480 --> 12:22.720
-And this really looks like a match filter, if you're familiar, which, when applied to
-12:22.720 --> 12:27.480
-the samples, would indeed be able to tell you whether the sample came from the 0 class
-12:27.480 --> 12:28.640
-or not.
-12:28.640 --> 12:33.680
-And as far as for the value itself, well, the maximized sliced mutual information value
-12:33.680 --> 12:39.800
-ends up being roughly 0.7, which is quite close to the 1 bit upper bound, and is much,
-12:39.800 --> 12:44.400
-much larger than what you would get if you would not learn A, and let's say just instantiate
-12:44.400 --> 12:49.480
-it as a matrix with IID entries drawn according to some distribution.
-12:49.480 --> 12:53.640
-And this is just to say that something meaningful indeed being learned here, and something meaningful
-12:53.640 --> 13:00.160
-indeed happens when you maximize the sliced mutual information as your optimization objective.
-13:00.160 --> 13:03.400
-OK, so yeah, that's basically it.
-13:03.400 --> 13:09.160
-And just to recap, we introduced sliced mutual information, which is this average of scalar
-13:09.160 --> 13:12.160
-mutual information terms between one-dimensional projections.
-13:12.160 --> 13:15.880
-We've seen that it preserves much of the structure of classic mutual information.
-13:15.880 --> 13:22.280
-It can be efficiently computed and estimated from samples, and can also be, in fact, increased
-13:22.280 --> 13:28.280
-by our processing if, indeed, your processing gives rise to more informative projections.
-13:28.280 --> 13:32.960
-And we've presented some proof of concept applications to independence testing, to feature
-13:32.960 --> 13:33.960
-extraction.
-13:33.960 --> 13:35.800
-We have a couple of more in the paper.
-13:35.800 --> 13:36.960
-But let me say this.
-13:36.960 --> 13:41.480
-While this is mostly theoretical work, and a large-scale empirical exploration is sort
-13:41.480 --> 13:46.640
-of beyond its scope, we firmly believe that sliced mutual information will be extremely
-13:46.640 --> 13:51.360
-useful for various such tasks, and are very excited to look into this in the future.
-13:51.360 --> 13:52.680
-And yeah, with that, I'll stop.
-13:52.680 --> 13:57.220
-Thank you guys for listening, and do visit us at the poster, and check out the paper
-13:57.220 --> 14:12.560
-if you would like to know more.

demo_data/nips-2021/25953/video.mp4 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:06f4968133dc8ada5fd9bf717fcd61a91049cd3c3034553cb6c2490f292c8a42
-size 90905227

demo_data/nips-2021/25962/metadata.json DELETED Viewed

@@ -1,3 +0,0 @@
-{
-    "title": "Locally differentially private estimation of functionals of discrete distributions"
-}

demo_data/nips-2021/25962/transcript_whisper_large-v2.txt DELETED Viewed

@@ -1,51 +0,0 @@
-Bonjour à tous, je suis Yannis Hartel et je vais vous présenter un travail sur l'estimation
-de fonctionnalité en termes de certaines contraintes particulières de la privacité.
-C'est un travail en lien avec mon conseiller postdoc, le professeur Cristina Gutucia.
-Nous sommes intéressés par le fonctionnalité de la somme de puissance, qui est la somme de probabilités associées
-à une distribution discrète, à la puissance gamma, où gamma est un nombre réel positif.
-Donc, ce fonctionnalité de la somme de puissance est un exemple d'information qui se déroule dans différents domaines
-comme les statistiques, l'apprentissage de machines, la théorie de l'information, la science de la neurone, etc.
-Voici donc le problème statistique standard, où l'objectif est d'estimer la somme de puissance fonctionnelle
-basée sur des exemples NIID, X1, X2 jusqu'à XN, qui suivent une distribution discrète B avec une taille d'alphabet K.
-Une approche beaucoup utilisée est le estimateur de plug-in, où l'on utilise un estimateur du paramètre P
-pour construire un estimateur du fonctionnalité, à travers le principe de plug-in.
-Cette approche n'est pas seulement simple et intuitive, mais elle est aussi théoriquement saine,
-car elle satisfait une efficacité asymptotique et une néro-optimalité non-asymptote.
-La question intéressante de notre paper est de savoir si cette approche de plug-in
-fonctionne dans un état de séparation non standard, où l'on impose une contrainte de privé,
-et plus précisément, le setup de la privé différente local.
-Ce qui signifie que l'on impose un état de privé fort, où l'on n'a pas accès aux données initiales et sensibles, les XI.
-Au lieu de ça, l'on a seulement accès à une version privée de XI.
-Voici la représentation d'un mécanisme simple qui n'est pas interactif.
-Les termes local ici reflètent le fait que le mécanisme QI ne voit que les données XI.
-En d'autres mots, il n'y a pas de troisième parti confiant qui a accès à toutes les données sensibles.
-C'est un mécanisme de privé non-interactif simple, mais bien sûr, nous sommes aussi intéressés par des mécanismes plus sophistiqués,
-notamment le mécanisme de séquence interactif, où chaque QI voit les données privées dévoilées précédemment,
-et les données privées de XI, et les données privées de XI.
-Dans cette étude non-standard, nous retournons au problème original de l'estimation fonctionnelle de la power sum,
-où nous n'avons qu'accès à des données privées de XI jusqu'à XL.
-Notre première contribution est de donner une caractérisation tigrée et non-transomatique du erreur de caractérisation de la power sum de l'estimateur.
-Ce résultat montre que l'estimateur de la power sum n'est pas optimal.
-Cela contraste avec la performance de l'estimateur de la power sum dans le problème statistique standard.
-Le message ici est que les bons estimateurs dans le setup standard ne sont pas toujours bons estimateurs dans le setup local privacy.
-Notre deuxième contribution est la correction du estimateur de plug-in grâce à une attentionnée de troncation de Pk de petites probabilités.
-Cette correction conduit à une réduction significative du risque d'erreur.
-En particulier, le risque devient indépendant du size alphabétique K lorsque K est grand.
-Cette deuxième contribution, par contre, se base sur un mécanisme de privé non-interactif simple.
-Dans la seconde partie du document, nous examinons un mécanisme de séquence interactive plus sophistiqué,
-pour lequel nous construisons une procédure de deux pas qui nous permet de réduire le risque grâce à un facteur logarithmique.
-Enfin, à la fin du document, nous fournissons un lien universel en bas sur le risque d'erreur
-avec respect à tous les estimateurs et tous les mécanismes non-interactifs et séquentially interactifs.
-Malheureusement, ce lien bas est un lien d'accords uniquement dans certains cas,
-ce qui nous laisse avec quelques questions très importantes à poser sur ce problème.
-Je pense que ce premier travail sur l'estimation fonctionnelle dans le contexte de la privé locale
-vous donne au moins trois points clés.
-Le premier point clé est le besoin de construire une procédure statistique prudente pour la configuration de la privé locale,
-puisque c'est un setup où un bon estimateur dans un cadre standard n'a pas nécessairement de fonction.
-Le deuxième point clé est que l'approche de type de plug-in analysée dans ce document
-sert comme un benchmark pour de futurs travaux et des procédures plus sophistiquées.
-Et le dernier point clé est que notre analyse de l'approche de type de plug-in et des mécanismes non-interactifs
-montrent des régimes où le problème d'estimation est difficile
-et espérons que cela incite les gens à amener des développements ici.
-Merci à tous, et pour plus de détails, veuillez vérifier notre document en ligne.
-Bye!

demo_data/nips-2021/25962/transcript_whisper_large-v2.vtt DELETED Viewed

@@ -1,155 +0,0 @@
-WEBVTT
-00:00.000 --> 00:14.000
-Bonjour à tous, je suis Yannis Hartel et je vais vous présenter un travail sur l'estimation
-00:14.000 --> 00:18.000
-de fonctionnalité en termes de certaines contraintes particulières de la privacité.
-00:18.000 --> 00:24.000
-C'est un travail en lien avec mon conseiller postdoc, le professeur Cristina Gutucia.
-00:24.000 --> 00:30.000
-Nous sommes intéressés par le fonctionnalité de la somme de puissance, qui est la somme de probabilités associées
-00:30.000 --> 00:37.000
-à une distribution discrète, à la puissance gamma, où gamma est un nombre réel positif.
-00:37.000 --> 00:46.000
-Donc, ce fonctionnalité de la somme de puissance est un exemple d'information qui se déroule dans différents domaines
-00:46.000 --> 00:54.000
-comme les statistiques, l'apprentissage de machines, la théorie de l'information, la science de la neurone, etc.
-00:54.000 --> 01:00.000
-Voici donc le problème statistique standard, où l'objectif est d'estimer la somme de puissance fonctionnelle
-01:00.000 --> 01:10.000
-basée sur des exemples NIID, X1, X2 jusqu'à XN, qui suivent une distribution discrète B avec une taille d'alphabet K.
-01:10.000 --> 01:19.000
-Une approche beaucoup utilisée est le estimateur de plug-in, où l'on utilise un estimateur du paramètre P
-01:19.000 --> 01:25.000
-pour construire un estimateur du fonctionnalité, à travers le principe de plug-in.
-01:25.000 --> 01:32.000
-Cette approche n'est pas seulement simple et intuitive, mais elle est aussi théoriquement saine,
-01:32.000 --> 01:38.000
-car elle satisfait une efficacité asymptotique et une néro-optimalité non-asymptote.
-01:38.000 --> 01:45.000
-La question intéressante de notre paper est de savoir si cette approche de plug-in
-01:45.000 --> 01:50.000
-fonctionne dans un état de séparation non standard, où l'on impose une contrainte de privé,
-01:50.000 --> 01:55.000
-et plus précisément, le setup de la privé différente local.
-01:55.000 --> 02:06.000
-Ce qui signifie que l'on impose un état de privé fort, où l'on n'a pas accès aux données initiales et sensibles, les XI.
-02:06.000 --> 02:12.000
-Au lieu de ça, l'on a seulement accès à une version privée de XI.
-02:12.000 --> 02:22.000
-Voici la représentation d'un mécanisme simple qui n'est pas interactif.
-02:22.000 --> 02:30.000
-Les termes local ici reflètent le fait que le mécanisme QI ne voit que les données XI.
-02:30.000 --> 02:38.000
-En d'autres mots, il n'y a pas de troisième parti confiant qui a accès à toutes les données sensibles.
-02:38.000 --> 02:48.000
-C'est un mécanisme de privé non-interactif simple, mais bien sûr, nous sommes aussi intéressés par des mécanismes plus sophistiqués,
-02:48.000 --> 02:55.000
-notamment le mécanisme de séquence interactif, où chaque QI voit les données privées dévoilées précédemment,
-02:55.000 --> 03:00.000
-et les données privées de XI, et les données privées de XI.
-03:00.000 --> 03:10.000
-Dans cette étude non-standard, nous retournons au problème original de l'estimation fonctionnelle de la power sum,
-03:10.000 --> 03:15.000
-où nous n'avons qu'accès à des données privées de XI jusqu'à XL.
-03:15.000 --> 03:26.000
-Notre première contribution est de donner une caractérisation tigrée et non-transomatique du erreur de caractérisation de la power sum de l'estimateur.
-03:26.000 --> 03:33.000
-Ce résultat montre que l'estimateur de la power sum n'est pas optimal.
-03:33.000 --> 03:41.000
-Cela contraste avec la performance de l'estimateur de la power sum dans le problème statistique standard.
-03:41.000 --> 03:50.000
-Le message ici est que les bons estimateurs dans le setup standard ne sont pas toujours bons estimateurs dans le setup local privacy.
-03:50.000 --> 04:00.000
-Notre deuxième contribution est la correction du estimateur de plug-in grâce à une attentionnée de troncation de Pk de petites probabilités.
-04:00.000 --> 04:06.000
-Cette correction conduit à une réduction significative du risque d'erreur.
-04:06.000 --> 04:13.000
-En particulier, le risque devient indépendant du size alphabétique K lorsque K est grand.
-04:13.000 --> 04:22.000
-Cette deuxième contribution, par contre, se base sur un mécanisme de privé non-interactif simple.
-04:22.000 --> 04:29.000
-Dans la seconde partie du document, nous examinons un mécanisme de séquence interactive plus sophistiqué,
-04:29.000 --> 04:40.000
-pour lequel nous construisons une procédure de deux pas qui nous permet de réduire le risque grâce à un facteur logarithmique.
-04:40.000 --> 04:45.000
-Enfin, à la fin du document, nous fournissons un lien universel en bas sur le risque d'erreur
-04:45.000 --> 04:51.000
-avec respect à tous les estimateurs et tous les mécanismes non-interactifs et séquentially interactifs.
-04:51.000 --> 04:56.000
-Malheureusement, ce lien bas est un lien d'accords uniquement dans certains cas,
-04:56.000 --> 05:02.000
-ce qui nous laisse avec quelques questions très importantes à poser sur ce problème.
-05:02.000 --> 05:10.000
-Je pense que ce premier travail sur l'estimation fonctionnelle dans le contexte de la privé locale
-05:10.000 --> 05:14.000
-vous donne au moins trois points clés.
-05:14.000 --> 05:23.000
-Le premier point clé est le besoin de construire une procédure statistique prudente pour la configuration de la privé locale,
-05:23.000 --> 05:31.000
-puisque c'est un setup où un bon estimateur dans un cadre standard n'a pas nécessairement de fonction.
-05:31.000 --> 05:38.000
-Le deuxième point clé est que l'approche de type de plug-in analysée dans ce document
-05:38.000 --> 05:43.000
-sert comme un benchmark pour de futurs travaux et des procédures plus sophistiquées.
-05:43.000 --> 05:51.000
-Et le dernier point clé est que notre analyse de l'approche de type de plug-in et des mécanismes non-interactifs
-05:51.000 --> 05:56.000
-montrent des régimes où le problème d'estimation est difficile
-05:56.000 --> 06:01.000
-et espérons que cela incite les gens à amener des développements ici.
-06:01.000 --> 06:08.000
-Merci à tous, et pour plus de détails, veuillez vérifier notre document en ligne.
-06:08.000 --> 06:22.000
-Bye!

demo_data/nips-2021/25962/video.mp4 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:922f83c4e8f367bc0341f90d1b55d4e3bafe1296c7cc679dc8057a844f5c38ef
-size 40157100

demo_data/nips-2021/25964/metadata.json DELETED Viewed

@@ -1,3 +0,0 @@
-{
-    "title": "Reinforcement Learning in Linear MDPs: Constant Regret and Representation Selection"
-}

demo_data/nips-2021/25964/transcript_whisper_large-v2.txt DELETED Viewed

@@ -1,366 +0,0 @@
-e la possibilità di eseguire un'operazione di modello di un'algebra.
-Questo è un'operazione che è stata creata per il nostro studio,
-e che è stato creato per il nostro studio.
-Ciao a tutti, sono Matteo Papini,
-e questo è un lavoro insieme con Andrea Tirinzoni,
-Aldo Pacchiano, Marcello Restelli,
-Alessandro Lazzarici e Matteo Pirotta.
-Il nostro lavoro è motivato dall'efficacia
-di algoritmi di imparazione di rinforzamento profondo
-per risolvere tasche complesse, come i videoghi.
-Una caratteristica fondamentale di questi metodi
-è la possibilità di eseguire neural networks
-per eseguire rappresentazioni complesse delle tasche
-che permette di rappresentare e imparare
-le polizie ottime efficacemente.
-Capire cosa fa una rappresentazione buona
-e come trovarne una
-è fondamentale per disegnare
-migliori algoritmi di imparazione di rinforzamento.
-In questo lavoro, per prima volta,
-ci sono state presentate caratterizzazioni formali
-di rappresentazioni buone per l'imparazione di rinforzamento.
-Abbiamo mostrato che usare una rappresentazione buona
-può davvero beneficiare l'efficienza di imparazione
-e fornire garantie di regretto costante.
-Finalmente, abbiamo mostrato come una rappresentazione buona
-può essere selezionata dall'interazione online,
-un primo passaggio verso l'apprendimento di rappresentazione per RL.
-Ma prima di tutto, qualche background.
-Il problema di imparazione è modellato
-come un processo di decisione di marco finito di orizzonte, o MDP.
-In ogni passaggio di tempo, l'agente osserva un stato dell'ambiente,
-prende un'azione e riceve una rinforza
-e un stato successivo come risultato.
-Questi sono determinati rispettivamente
-da una funzione di rinforza e una funzione di transizione
-che sono un'unità di tempo e un'unità di non-conoscenza.
-L'interazione è dividita in due episodi
-di lunghezza finita, che si chiama l'orizzonte.
-All'ultimo episodio, il stato è risalto
-a seconda della distribuzione fissata.
-Il comportamento dell'agente è modellato da una polizia,
-che è una mappatura da stati all'azione
-che può anche essere dipendente del tempo.
-La funzione di valore, o funzione Q della polizia Pi,
-dà la rinforza aspettata totale
-ottenuta prendendo l'azione A in stato S a tempo H
-e poi seguendo la polizia fino all'ultimo episodio.
-Un'ottima polizia è garantita
-che la funzione Q si massima su tutti i stati.
-Facciamo un'assumzione extra
-che ogni stato admette un'azione ottima unica.
-Quando il numero di stati è molto grande o anche infinito,
-imparare l'ottima polizia può essere molto difficile.
-Quindi guardiamo i linear MDPs
-dove l'agente ha accesso a una rappresentazione compatta.
-Questa è una mappatura di caratteristiche
-da stati e azioni a vectori d-dimensional
-dove D è più piccolo.
-Potete vedere queste caratteristiche
-come l'ultimo strato scoperto di un'intera rete neurale.
-Nell'apprendimento di rinforzamento profondo
-impariamo tutti i pesi della rete simultaneamente.
-Qui mantendremo la rappresentazione fissa
-e impareremo solo i finali parametri
-che sono i pesi di una combinazione lineare.
-Questa funzione lineare, almeno,
-deve essere in grado di rappresentare la funzione Q ottima
-in modo da poterla usare per prendere azioni ottime.
-Ma, infine,
-essere in grado di rappresentare la funzione Q ottima
-non è abbastanza per l'apprendimento efficace
-perché un numero esponenziale di esempi
-può ancora essere richiesto.
-Per evitare questo,
-ci sono necessità di assumizioni strutturali extra
-sull'MDP,
-e alcune sono state proposte nella literatura.
-Nel MDP di basso rango,
-sia la funzione di rinforzamento che la funzione di transizione
-sono lineari nelle stesse funzioni.
-Queste funzioni possono essere tempo-indipendenti.
-Assumiamo solo per semplicità
-che le due funzioni condividono la stessa dimensione D.
-Una prima conseguenza della struttura di basso rango
-è che la funzione Q di ogni polizia
-può essere rappresentata come una funzione lineare delle funzioni.
-Una assumzione strutturale più forte è la rinforzamento di Bellman.
-In questi MDP,
-tutte le funzioni lineare delle funzioni
-devono essere chiuse sotto l'operatore di optimità di Bellman.
-La struttura di basso rango implica la chiusura di Bellman,
-ma l'opposto non è vero.
-Indeed, nelle MDP di chiusura di Bellman,
-solo l'ottima funzione Q
-è garantita di essere realizzabile lineariamente.
-Le algoritmi di imparazione di rinforzamento efficace
-sono state proposte per questi settimenti.
-Possiamo evaluare le funzioni
-usando il concetto di risalto,
-che è l'amounto totale di sub-optimità
-che viene sofferto dall'agente
-durante il processo di imparazione
-rispetto alla polizia ottima.
-Nelle MDP di basso rango,
-l'algoritmo LSVI-UCB
-soffre solo un regalo sublineare
-nel caso più grave.
-Eleanor è una versione raffinata
-che funziona nel caso più generale
-della chiusura di Bellman
-e ha una migliore dipendenza
-sulla dimensione di caratteristiche.
-Doveva essere notato, però,
-che Eleanor è computazionale intrattabile.
-Per il LSVI-UCB
-abbiamo anche un regalo
-di base di istanze
-che è logaritmico
-nel numero totale di interazioni.
-Qui Delta denuncia
-il capo di sub-optimità
-di una pariera di attesa statale
-che è assumato di avere
-un minimo ben definito.
-Tutti questi regali di base
-ignorano la qualità della rappresentazione,
-a parte le assumazioni strutturali
-che sono necessarie
-per la sua gestione.
-La domanda che cercheremo di rispondere è questa.
-Possiamo raggiungere
-anche piccoli dolori
-con una buona rappresentazione?
-Per rendere questo concetto
-di buona rappresentazione formale
-introduciamo la proprietà Unisoft.
-Una rappresentazione è Unisoft
-se le caratteristiche ottime
-spostano l'intero spazio di caratteristiche.
-Le caratteristiche ottime sono
-le caratteristiche delle azioni ottime
-in stati che sono raggiuntibili
-alla propria politica ottimale.
-Intuitivamente, la proprietà Unisoft
-garantisce che le caratteristiche ottime
-sono diverse abbastanza
-per che l'agente
-cominci rapidamente alla politica ottimale
-senza ridurre
-l'amounto di informazioni che riceve
-sulla tasca in generale.
-Possiamo anche misurare
-il grado di diversità della rappresentazione
-guardando i più piccoli valori
-degli eigenvali
-della matrica di covarianza delle caratteristiche ottime.
-Questo parametro di Lambda
-porterà un ruolo importante
-nelle nostre regrette.
-Notate che un valore più alto di Lambda
-è migliore perché denota
-più diversità di caratteristiche
-e che Lambda può essere al massimo
-una sotto assumizioni comuni
-sulla magnitude di caratteristiche.
-Ma in quale senso sono queste rappresentazioni
-ottime?
-Ciò che abbiamo mostrato in MDP lineari
-è che Unisoft è sinonimo
-con regrette costanti.
-Per prima cosa, abbiamo mostrato
-che la proprietà di Unisoft
-è necessaria per raggiungere
-regrette costanti in MDP
-con regretti lineari.
-Questo appartiene a MDPs di basso rango,
-Bellman closure,
-e anche a MDPs di mixtura lineare
-che sono un'altra
-assumazione strutturale comune.
-Ma Unisoft è anche sufficiente
-per regrette costanti
-in casi interessanti.
-In MDPs di basso rango,
-SVI-UCB raggiunge
-regrette costanti se e solo se
-la rappresentazione è Unisoft.
-Con una alta probabilità,
-un numero finito
-di interaczioni è sufficiente
-per l'agente imparare
-perfettamente la polizia ottimale.
-Quindi, la regrette può essere
-rilassata in termini di questo tempo costante
-regardless of the
-total number of episodes k.
-In altri parole, la regrette
-è costante.
-Notate come il tempo τ
-dipende inversamente
-sul parametro λ.
-Indeed, con una mappa di
-più diversità di caratteristiche, possiamo imparare
-la polizia ottimale più velocemente.
-Abbiamo un risultato simile
-per Eleanor nel caso più generale
-di MDPs di Bellman closure,
-con anche una migliore
-dipendenza sulla dimensione d
-della caratteristica.
-Infine, la mancanza di
-lombari per Eleanor
-dà questa polinomiale
-dipendenza sul parametro λ
-rispetto a una dipendenza logaritmica
-nel caso di LSVI-UCB.
-Ma questo potrebbe ben essere
-un artefatto del nostro provo.
-Per ricapitulare, abbiamo mostrato
-che l'Unisoft è
-sia necessario che sufficiente
-per raggiungere regrette costanti
-in MDPs di Bellman closure
-e di low rank, e ha
-provvinto regrette costanti
-per i bounds superiori per algoritmi comuni.
-Nella ultima parte del
-talco, mostriamo come
-le representazioni buone possono essere
-scelte online.
-Ci concentriamo su MDPs di low rank
-per semplicità.
-L'agente è dato un set
-di N rappresentazioni candidate
-che rappresentano
-la stessa MDP di low rank
-senza misspecificazione.
-Le rappresentazioni possono avere
-diverse dimensioni.
-Questo differe dall'approccio tipico
-di rappresentazione di lezione in RL
-dove si cercano di trovare
-una rappresentazione accurata
-da una classe di funzioni realizzabili.
-Questo permette di
-risolvere le misspecificazioni, ma
-è tipicamente fatto offline.
-Il nostro obiettivo è
-imparare così efficientemente
-come se usassimo la migliore
-rappresentazione candidata nel set
-senza sapere in avanzo.
-Ovviamente, se una delle candidate
-è Unisoft, vorremmo
-ottenere un regalo costante.
-L'algoritmo che proponiamo
-è LSVI Leader.
-Si guida
-N istanze parallele di LSVI UCB,
-una per ogni rappresentazione
-candidata.
-Per ogni rappresentazione, usiamo
-tutte le date collezionate
-dall'agente per esimerare
-il parametro dell'ottima
-funzione Q accordo
-a questa rappresentazione.
-Questo è fatto con una combinazione
-di square e induzione sbattuta.
-Un bonus di esplorazione
-viene aggiunto all'estimato
-del parametro per rendere
-l'estimato ottimista, come nel caso di LSVI UCB.
-Ma ora
-abbiamo un parametro ottimista
-per ogni rappresentazione
-e l'azione viene scelta
-per maximizzare il più piccolo
-parametro ottimista,
-che è anche l'estimato più tico.
-Notate come questo
-è in realtà più potente
-dell'algoritmo di selezione del modello
-perché possiamo usare
-una rappresentazione diversa
-per ogni stato.
-Vediamo che il regalo del leader di LSVI
-è superiore
-a quello di LSVI UCB
-se è condannato con la rappresentazione
-migliore dei candidati,
-a meno di un fattore,
-che è il numero di candidati
-in square.
-Questo significa che se abbiamo
-una rappresentazione di Unisoft nel set,
-il leader di LSVI
-raggiunge il regalo di selezione.
-Ma il leader di LSVI
-può combinare rappresentazioni
-attraverso stagi, stati e azioni,
-e quindi
-a volte può raggiungere
-il regalo di selezione
-anche se non c'è una rappresentazione di candidati
-di Unisoft.
-I nostri risultati teoretici sono anche supportati
-dai risultati empirici
-in MDPs di piccolo regalo di selezione.
-Questi plotti mostrano il regalo di selezione
-come funzione del numero di episodi.
-A sinistra abbiamo
-il regalo di LSVI-UCB
-che è gestito con
-diverse rappresentazioni.
-Di queste, l'unica rappresentazione
-in grigio nel plotto
-è Unisoft, e solo in questo caso
-LSVI-UCB è in grado
-di raggiungere regali costanti.
-A sinistra abbiamo il regalo
-del leader di LSVI
-che è gestito con vari set di candidati.
-In tutti questi casi,
-il leader di LSVI raggiunge
-regali costanti.
-Ovviamente, senza sapere
-la migliore rappresentazione in avanzo,
-ci serve più tempo per imparare la polizia ottima,
-ma questo è stato anche aspettato
-dalla nostra regola di selezione.
-Il plotto arancione è particolarmente
-interessante, perché in questo caso
-l'unica rappresentazione di Unisoft,
-numero 1,
-non è nel set di candidati,
-ma ancora LSVI-leader è in grado
-di raggiungere regali costanti
-combinando le representazioni rimaste.
-Nel lavoro futuro,
-vorremmo migliorare questo fattore
-di sqvrtn nel regalo del leader di LSVI,
-perché nel caso dei banditi lineari
-la dipendenza sull'umare
-delle rappresentazioni è solo logaritmica.
-Vorremmo anche
-estendere il leader di LSVI
-per gestire le rappresentazioni
-di candidati che sono miscele.
-Tuttavia, questa
-selezione delle rappresentazioni è
-solo un passaggio verso
-il learning of representation,
-che significa imparare
-la rappresentazione online da scratch.
-Questo è già fatto
-in pratica con il learning di
-rinforzamento profondo, ma la teoria
-di questo è scomoda.
-Finalmente, possiamo considerare
-il learning di rinforzamento multitasca,
-dove una singola rappresentazione
-potrebbe essere buona per un
-composto di MDPs che condividono
-una struttura. Grazie.

demo_data/nips-2021/25964/transcript_whisper_large-v2.vtt DELETED Viewed

@@ -1,1100 +0,0 @@
-WEBVTT
-00:00.000 --> 00:04.000
-e la possibilità di eseguire un'operazione di modello di un'algebra.
-00:04.000 --> 00:07.000
-Questo è un'operazione che è stata creata per il nostro studio,
-00:07.000 --> 00:09.000
-e che è stato creato per il nostro studio.
-00:09.000 --> 00:11.000
-Ciao a tutti, sono Matteo Papini,
-00:11.000 --> 00:13.000
-e questo è un lavoro insieme con Andrea Tirinzoni,
-00:13.000 --> 00:15.000
-Aldo Pacchiano, Marcello Restelli,
-00:15.000 --> 00:18.000
-Alessandro Lazzarici e Matteo Pirotta.
-00:18.000 --> 00:21.000
-Il nostro lavoro è motivato dall'efficacia
-00:21.000 --> 00:23.000
-di algoritmi di imparazione di rinforzamento profondo
-00:23.000 --> 00:26.000
-per risolvere tasche complesse, come i videoghi.
-00:26.000 --> 00:28.000
-Una caratteristica fondamentale di questi metodi
-00:28.000 --> 00:30.000
-è la possibilità di eseguire neural networks
-00:30.000 --> 00:33.000
-per eseguire rappresentazioni complesse delle tasche
-00:33.000 --> 00:36.000
-che permette di rappresentare e imparare
-00:36.000 --> 00:39.000
-le polizie ottime efficacemente.
-00:39.000 --> 00:42.000
-Capire cosa fa una rappresentazione buona
-00:42.000 --> 00:44.000
-e come trovarne una
-00:44.000 --> 00:46.000
-è fondamentale per disegnare
-00:46.000 --> 00:48.000
-migliori algoritmi di imparazione di rinforzamento.
-00:48.000 --> 00:50.000
-In questo lavoro, per prima volta,
-00:50.000 --> 00:52.000
-ci sono state presentate caratterizzazioni formali
-00:52.000 --> 00:55.000
-di rappresentazioni buone per l'imparazione di rinforzamento.
-00:55.000 --> 00:58.000
-Abbiamo mostrato che usare una rappresentazione buona
-00:58.000 --> 01:01.000
-può davvero beneficiare l'efficienza di imparazione
-01:01.000 --> 01:03.000
-e fornire garantie di regretto costante.
-01:03.000 --> 01:06.000
-Finalmente, abbiamo mostrato come una rappresentazione buona
-01:06.000 --> 01:09.000
-può essere selezionata dall'interazione online,
-01:09.000 --> 01:13.000
-un primo passaggio verso l'apprendimento di rappresentazione per RL.
-01:13.000 --> 01:16.000
-Ma prima di tutto, qualche background.
-01:16.000 --> 01:18.000
-Il problema di imparazione è modellato
-01:18.000 --> 01:22.000
-come un processo di decisione di marco finito di orizzonte, o MDP.
-01:22.000 --> 01:26.000
-In ogni passaggio di tempo, l'agente osserva un stato dell'ambiente,
-01:26.000 --> 01:28.000
-prende un'azione e riceve una rinforza
-01:28.000 --> 01:31.000
-e un stato successivo come risultato.
-01:31.000 --> 01:33.000
-Questi sono determinati rispettivamente
-01:33.000 --> 01:36.000
-da una funzione di rinforza e una funzione di transizione
-01:36.000 --> 01:39.000
-che sono un'unità di tempo e un'unità di non-conoscenza.
-01:39.000 --> 01:42.000
-L'interazione è dividita in due episodi
-01:42.000 --> 01:46.000
-di lunghezza finita, che si chiama l'orizzonte.
-01:46.000 --> 01:49.000
-All'ultimo episodio, il stato è risalto
-01:49.000 --> 01:52.000
-a seconda della distribuzione fissata.
-01:52.000 --> 01:55.000
-Il comportamento dell'agente è modellato da una polizia,
-01:55.000 --> 01:58.000
-che è una mappatura da stati all'azione
-01:58.000 --> 02:01.000
-che può anche essere dipendente del tempo.
-02:01.000 --> 02:04.000
-La funzione di valore, o funzione Q della polizia Pi,
-02:04.000 --> 02:07.000
-dà la rinforza aspettata totale
-02:07.000 --> 02:11.000
-ottenuta prendendo l'azione A in stato S a tempo H
-02:11.000 --> 02:15.000
-e poi seguendo la polizia fino all'ultimo episodio.
-02:15.000 --> 02:18.000
-Un'ottima polizia è garantita
-02:18.000 --> 02:22.000
-che la funzione Q si massima su tutti i stati.
-02:22.000 --> 02:25.000
-Facciamo un'assumzione extra
-02:25.000 --> 02:28.000
-che ogni stato admette un'azione ottima unica.
-02:28.000 --> 02:31.000
-Quando il numero di stati è molto grande o anche infinito,
-02:31.000 --> 02:35.000
-imparare l'ottima polizia può essere molto difficile.
-02:35.000 --> 02:38.000
-Quindi guardiamo i linear MDPs
-02:38.000 --> 02:42.000
-dove l'agente ha accesso a una rappresentazione compatta.
-02:42.000 --> 02:44.000
-Questa è una mappatura di caratteristiche
-02:44.000 --> 02:47.000
-da stati e azioni a vectori d-dimensional
-02:47.000 --> 02:50.000
-dove D è più piccolo.
-02:50.000 --> 02:52.000
-Potete vedere queste caratteristiche
-02:52.000 --> 02:55.000
-come l'ultimo strato scoperto di un'intera rete neurale.
-02:55.000 --> 02:57.000
-Nell'apprendimento di rinforzamento profondo
-02:57.000 --> 03:01.000
-impariamo tutti i pesi della rete simultaneamente.
-03:01.000 --> 03:04.000
-Qui mantendremo la rappresentazione fissa
-03:04.000 --> 03:07.000
-e impareremo solo i finali parametri
-03:07.000 --> 03:10.000
-che sono i pesi di una combinazione lineare.
-03:10.000 --> 03:13.000
-Questa funzione lineare, almeno,
-03:13.000 --> 03:16.000
-deve essere in grado di rappresentare la funzione Q ottima
-03:16.000 --> 03:20.000
-in modo da poterla usare per prendere azioni ottime.
-03:20.000 --> 03:22.000
-Ma, infine,
-03:22.000 --> 03:24.000
-essere in grado di rappresentare la funzione Q ottima
-03:24.000 --> 03:27.000
-non è abbastanza per l'apprendimento efficace
-03:27.000 --> 03:29.000
-perché un numero esponenziale di esempi
-03:29.000 --> 03:31.000
-può ancora essere richiesto.
-03:31.000 --> 03:33.000
-Per evitare questo,
-03:33.000 --> 03:35.000
-ci sono necessità di assumizioni strutturali extra
-03:35.000 --> 03:37.000
-sull'MDP,
-03:37.000 --> 03:40.000
-e alcune sono state proposte nella literatura.
-03:40.000 --> 03:42.000
-Nel MDP di basso rango,
-03:42.000 --> 03:45.000
-sia la funzione di rinforzamento che la funzione di transizione
-03:45.000 --> 03:48.000
-sono lineari nelle stesse funzioni.
-03:48.000 --> 03:51.000
-Queste funzioni possono essere tempo-indipendenti.
-03:51.000 --> 03:53.000
-Assumiamo solo per semplicità
-03:53.000 --> 03:56.000
-che le due funzioni condividono la stessa dimensione D.
-03:56.000 --> 03:59.000
-Una prima conseguenza della struttura di basso rango
-03:59.000 --> 04:02.000
-è che la funzione Q di ogni polizia
-04:02.000 --> 04:06.000
-può essere rappresentata come una funzione lineare delle funzioni.
-04:06.000 --> 04:09.000
-Una assumzione strutturale più forte è la rinforzamento di Bellman.
-04:09.000 --> 04:11.000
-In questi MDP,
-04:11.000 --> 04:13.000
-tutte le funzioni lineare delle funzioni
-04:13.000 --> 04:16.000
-devono essere chiuse sotto l'operatore di optimità di Bellman.
-04:16.000 --> 04:19.000
-La struttura di basso rango implica la chiusura di Bellman,
-04:19.000 --> 04:21.000
-ma l'opposto non è vero.
-04:21.000 --> 04:24.000
-Indeed, nelle MDP di chiusura di Bellman,
-04:24.000 --> 04:26.000
-solo l'ottima funzione Q
-04:26.000 --> 04:29.000
-è garantita di essere realizzabile lineariamente.
-04:29.000 --> 04:32.000
-Le algoritmi di imparazione di rinforzamento efficace
-04:32.000 --> 04:34.000
-sono state proposte per questi settimenti.
-04:34.000 --> 04:36.000
-Possiamo evaluare le funzioni
-04:36.000 --> 04:38.000
-usando il concetto di risalto,
-04:38.000 --> 04:41.000
-che è l'amounto totale di sub-optimità
-04:41.000 --> 04:43.000
-che viene sofferto dall'agente
-04:43.000 --> 04:45.000
-durante il processo di imparazione
-04:45.000 --> 04:47.000
-rispetto alla polizia ottima.
-04:47.000 --> 04:49.000
-Nelle MDP di basso rango,
-04:49.000 --> 04:52.000
-l'algoritmo LSVI-UCB
-04:52.000 --> 04:54.000
-soffre solo un regalo sublineare
-04:54.000 --> 04:56.000
-nel caso più grave.
-04:56.000 --> 04:58.000
-Eleanor è una versione raffinata
-04:58.000 --> 05:00.000
-che funziona nel caso più generale
-05:00.000 --> 05:02.000
-della chiusura di Bellman
-05:02.000 --> 05:04.000
-e ha una migliore dipendenza
-05:04.000 --> 05:06.000
-sulla dimensione di caratteristiche.
-05:06.000 --> 05:08.000
-Doveva essere notato, però,
-05:08.000 --> 05:10.000
-che Eleanor è computazionale intrattabile.
-05:10.000 --> 05:12.000
-Per il LSVI-UCB
-05:12.000 --> 05:14.000
-abbiamo anche un regalo
-05:14.000 --> 05:16.000
-di base di istanze
-05:16.000 --> 05:18.000
-che è logaritmico
-05:18.000 --> 05:20.000
-nel numero totale di interazioni.
-05:20.000 --> 05:22.000
-Qui Delta denuncia
-05:22.000 --> 05:24.000
-il capo di sub-optimità
-05:24.000 --> 05:26.000
-di una pariera di attesa statale
-05:26.000 --> 05:28.000
-che è assumato di avere
-05:28.000 --> 05:30.000
-un minimo ben definito.
-05:30.000 --> 05:32.000
-Tutti questi regali di base
-05:32.000 --> 05:34.000
-ignorano la qualità della rappresentazione,
-05:34.000 --> 05:36.000
-a parte le assumazioni strutturali
-05:36.000 --> 05:38.000
-che sono necessarie
-05:38.000 --> 05:40.000
-per la sua gestione.
-05:40.000 --> 05:42.000
-La domanda che cercheremo di rispondere è questa.
-05:42.000 --> 05:44.000
-Possiamo raggiungere
-05:44.000 --> 05:46.000
-anche piccoli dolori
-05:46.000 --> 05:48.000
-con una buona rappresentazione?
-05:48.000 --> 05:50.000
-Per rendere questo concetto
-05:50.000 --> 05:52.000
-di buona rappresentazione formale
-05:52.000 --> 05:54.000
-introduciamo la proprietà Unisoft.
-05:54.000 --> 05:56.000
-Una rappresentazione è Unisoft
-05:56.000 --> 05:58.000
-se le caratteristiche ottime
-05:58.000 --> 06:00.000
-spostano l'intero spazio di caratteristiche.
-06:00.000 --> 06:02.000
-Le caratteristiche ottime sono
-06:02.000 --> 06:04.000
-le caratteristiche delle azioni ottime
-06:04.000 --> 06:06.000
-in stati che sono raggiuntibili
-06:06.000 --> 06:08.000
-alla propria politica ottimale.
-06:08.000 --> 06:10.000
-Intuitivamente, la proprietà Unisoft
-06:10.000 --> 06:12.000
-garantisce che le caratteristiche ottime
-06:12.000 --> 06:14.000
-sono diverse abbastanza
-06:14.000 --> 06:16.000
-per che l'agente
-06:16.000 --> 06:18.000
-cominci rapidamente alla politica ottimale
-06:18.000 --> 06:20.000
-senza ridurre
-06:20.000 --> 06:22.000
-l'amounto di informazioni che riceve
-06:22.000 --> 06:24.000
-sulla tasca in generale.
-06:24.000 --> 06:26.000
-Possiamo anche misurare
-06:26.000 --> 06:28.000
-il grado di diversità della rappresentazione
-06:28.000 --> 06:30.000
-guardando i più piccoli valori
-06:30.000 --> 06:32.000
-degli eigenvali
-06:32.000 --> 06:34.000
-della matrica di covarianza delle caratteristiche ottime.
-06:34.000 --> 06:36.000
-Questo parametro di Lambda
-06:36.000 --> 06:38.000
-porterà un ruolo importante
-06:38.000 --> 06:40.000
-nelle nostre regrette.
-06:40.000 --> 06:42.000
-Notate che un valore più alto di Lambda
-06:42.000 --> 06:44.000
-è migliore perché denota
-06:44.000 --> 06:46.000
-più diversità di caratteristiche
-06:46.000 --> 06:48.000
-e che Lambda può essere al massimo
-06:48.000 --> 06:50.000
-una sotto assumizioni comuni
-06:50.000 --> 06:52.000
-sulla magnitude di caratteristiche.
-06:52.000 --> 06:54.000
-Ma in quale senso sono queste rappresentazioni
-06:54.000 --> 06:56.000
-ottime?
-06:56.000 --> 06:58.000
-Ciò che abbiamo mostrato in MDP lineari
-06:58.000 --> 07:00.000
-è che Unisoft è sinonimo
-07:00.000 --> 07:02.000
-con regrette costanti.
-07:02.000 --> 07:04.000
-Per prima cosa, abbiamo mostrato
-07:04.000 --> 07:06.000
-che la proprietà di Unisoft
-07:06.000 --> 07:08.000
-è necessaria per raggiungere
-07:08.000 --> 07:10.000
-regrette costanti in MDP
-07:10.000 --> 07:12.000
-con regretti lineari.
-07:12.000 --> 07:14.000
-Questo appartiene a MDPs di basso rango,
-07:14.000 --> 07:16.000
-Bellman closure,
-07:16.000 --> 07:18.000
-e anche a MDPs di mixtura lineare
-07:18.000 --> 07:20.000
-che sono un'altra
-07:20.000 --> 07:22.000
-assumazione strutturale comune.
-07:22.000 --> 07:24.000
-Ma Unisoft è anche sufficiente
-07:24.000 --> 07:26.000
-per regrette costanti
-07:26.000 --> 07:28.000
-in casi interessanti.
-07:28.000 --> 07:30.000
-In MDPs di basso rango,
-07:30.000 --> 07:32.000
-SVI-UCB raggiunge
-07:32.000 --> 07:34.000
-regrette costanti se e solo se
-07:34.000 --> 07:36.000
-la rappresentazione è Unisoft.
-07:36.000 --> 07:38.000
-Con una alta probabilità,
-07:38.000 --> 07:40.000
-un numero finito
-07:40.000 --> 07:42.000
-di interaczioni è sufficiente
-07:42.000 --> 07:44.000
-per l'agente imparare
-07:44.000 --> 07:46.000
-perfettamente la polizia ottimale.
-07:46.000 --> 07:48.000
-Quindi, la regrette può essere
-07:48.000 --> 07:50.000
-rilassata in termini di questo tempo costante
-07:50.000 --> 07:52.000
-regardless of the
-07:52.000 --> 07:54.000
-total number of episodes k.
-07:54.000 --> 07:56.000
-In altri parole, la regrette
-07:56.000 --> 07:58.000
-è costante.
-07:58.000 --> 08:00.000
-Notate come il tempo τ
-08:00.000 --> 08:02.000
-dipende inversamente
-08:02.000 --> 08:04.000
-sul parametro λ.
-08:04.000 --> 08:06.000
-Indeed, con una mappa di
-08:06.000 --> 08:08.000
-più diversità di caratteristiche, possiamo imparare
-08:08.000 --> 08:10.000
-la polizia ottimale più velocemente.
-08:10.000 --> 08:12.000
-Abbiamo un risultato simile
-08:12.000 --> 08:14.000
-per Eleanor nel caso più generale
-08:14.000 --> 08:16.000
-di MDPs di Bellman closure,
-08:16.000 --> 08:18.000
-con anche una migliore
-08:18.000 --> 08:20.000
-dipendenza sulla dimensione d
-08:20.000 --> 08:22.000
-della caratteristica.
-08:22.000 --> 08:24.000
-Infine, la mancanza di
-08:24.000 --> 08:26.000
-lombari per Eleanor
-08:26.000 --> 08:28.000
-dà questa polinomiale
-08:28.000 --> 08:30.000
-dipendenza sul parametro λ
-08:30.000 --> 08:32.000
-rispetto a una dipendenza logaritmica
-08:32.000 --> 08:34.000
-nel caso di LSVI-UCB.
-08:34.000 --> 08:36.000
-Ma questo potrebbe ben essere
-08:36.000 --> 08:38.000
-un artefatto del nostro provo.
-08:38.000 --> 08:40.000
-Per ricapitulare, abbiamo mostrato
-08:40.000 --> 08:42.000
-che l'Unisoft è
-08:42.000 --> 08:44.000
-sia necessario che sufficiente
-08:44.000 --> 08:46.000
-per raggiungere regrette costanti
-08:46.000 --> 08:48.000
-in MDPs di Bellman closure
-08:48.000 --> 08:50.000
-e di low rank, e ha
-08:50.000 --> 08:52.000
-provvinto regrette costanti
-08:52.000 --> 08:54.000
-per i bounds superiori per algoritmi comuni.
-08:54.000 --> 08:56.000
-Nella ultima parte del
-08:56.000 --> 08:58.000
-talco, mostriamo come
-08:58.000 --> 09:00.000
-le representazioni buone possono essere
-09:00.000 --> 09:02.000
-scelte online.
-09:02.000 --> 09:04.000
-Ci concentriamo su MDPs di low rank
-09:04.000 --> 09:06.000
-per semplicità.
-09:06.000 --> 09:08.000
-L'agente è dato un set
-09:08.000 --> 09:10.000
-di N rappresentazioni candidate
-09:10.000 --> 09:12.000
-che rappresentano
-09:12.000 --> 09:14.000
-la stessa MDP di low rank
-09:14.000 --> 09:16.000
-senza misspecificazione.
-09:16.000 --> 09:18.000
-Le rappresentazioni possono avere
-09:18.000 --> 09:20.000
-diverse dimensioni.
-09:20.000 --> 09:22.000
-Questo differe dall'approccio tipico
-09:22.000 --> 09:24.000
-di rappresentazione di lezione in RL
-09:24.000 --> 09:26.000
-dove si cercano di trovare
-09:26.000 --> 09:28.000
-una rappresentazione accurata
-09:28.000 --> 09:30.000
-da una classe di funzioni realizzabili.
-09:30.000 --> 09:32.000
-Questo permette di
-09:32.000 --> 09:34.000
-risolvere le misspecificazioni, ma
-09:34.000 --> 09:36.000
-è tipicamente fatto offline.
-09:36.000 --> 09:38.000
-Il nostro obiettivo è
-09:38.000 --> 09:40.000
-imparare così efficientemente
-09:40.000 --> 09:42.000
-come se usassimo la migliore
-09:42.000 --> 09:44.000
-rappresentazione candidata nel set
-09:44.000 --> 09:46.000
-senza sapere in avanzo.
-09:46.000 --> 09:48.000
-Ovviamente, se una delle candidate
-09:48.000 --> 09:50.000
-è Unisoft, vorremmo
-09:50.000 --> 09:52.000
-ottenere un regalo costante.
-09:52.000 --> 09:54.000
-L'algoritmo che proponiamo
-09:54.000 --> 09:56.000
-è LSVI Leader.
-09:56.000 --> 09:58.000
-Si guida
-09:58.000 --> 10:00.000
-N istanze parallele di LSVI UCB,
-10:00.000 --> 10:02.000
-una per ogni rappresentazione
-10:02.000 --> 10:04.000
-candidata.
-10:04.000 --> 10:06.000
-Per ogni rappresentazione, usiamo
-10:06.000 --> 10:08.000
-tutte le date collezionate
-10:08.000 --> 10:10.000
-dall'agente per esimerare
-10:10.000 --> 10:12.000
-il parametro dell'ottima
-10:12.000 --> 10:14.000
-funzione Q accordo
-10:14.000 --> 10:16.000
-a questa rappresentazione.
-10:16.000 --> 10:18.000
-Questo è fatto con una combinazione
-10:18.000 --> 10:20.000
-di square e induzione sbattuta.
-10:20.000 --> 10:22.000
-Un bonus di esplorazione
-10:22.000 --> 10:24.000
-viene aggiunto all'estimato
-10:24.000 --> 10:26.000
-del parametro per rendere
-10:26.000 --> 10:28.000
-l'estimato ottimista, come nel caso di LSVI UCB.
-10:28.000 --> 10:30.000
-Ma ora
-10:30.000 --> 10:32.000
-abbiamo un parametro ottimista
-10:32.000 --> 10:34.000
-per ogni rappresentazione
-10:34.000 --> 10:36.000
-e l'azione viene scelta
-10:36.000 --> 10:38.000
-per maximizzare il più piccolo
-10:38.000 --> 10:40.000
-parametro ottimista,
-10:40.000 --> 10:42.000
-che è anche l'estimato più tico.
-10:42.000 --> 10:44.000
-Notate come questo
-10:44.000 --> 10:46.000
-è in realtà più potente
-10:46.000 --> 10:48.000
-dell'algoritmo di selezione del modello
-10:48.000 --> 10:50.000
-perché possiamo usare
-10:50.000 --> 10:52.000
-una rappresentazione diversa
-10:52.000 --> 10:54.000
-per ogni stato.
-10:54.000 --> 10:56.000
-Vediamo che il regalo del leader di LSVI
-10:56.000 --> 10:58.000
-è superiore
-10:58.000 --> 11:00.000
-a quello di LSVI UCB
-11:00.000 --> 11:02.000
-se è condannato con la rappresentazione
-11:02.000 --> 11:04.000
-migliore dei candidati,
-11:04.000 --> 11:06.000
-a meno di un fattore,
-11:06.000 --> 11:08.000
-che è il numero di candidati
-11:08.000 --> 11:10.000
-in square.
-11:10.000 --> 11:12.000
-Questo significa che se abbiamo
-11:12.000 --> 11:14.000
-una rappresentazione di Unisoft nel set,
-11:14.000 --> 11:16.000
-il leader di LSVI
-11:16.000 --> 11:18.000
-raggiunge il regalo di selezione.
-11:18.000 --> 11:20.000
-Ma il leader di LSVI
-11:20.000 --> 11:22.000
-può combinare rappresentazioni
-11:22.000 --> 11:24.000
-attraverso stagi, stati e azioni,
-11:24.000 --> 11:26.000
-e quindi
-11:26.000 --> 11:28.000
-a volte può raggiungere
-11:28.000 --> 11:30.000
-il regalo di selezione
-11:30.000 --> 11:32.000
-anche se non c'è una rappresentazione di candidati
-11:32.000 --> 11:34.000
-di Unisoft.
-11:34.000 --> 11:36.000
-I nostri risultati teoretici sono anche supportati
-11:36.000 --> 11:38.000
-dai risultati empirici
-11:38.000 --> 11:40.000
-in MDPs di piccolo regalo di selezione.
-11:40.000 --> 11:42.000
-Questi plotti mostrano il regalo di selezione
-11:42.000 --> 11:44.000
-come funzione del numero di episodi.
-11:44.000 --> 11:46.000
-A sinistra abbiamo
-11:46.000 --> 11:48.000
-il regalo di LSVI-UCB
-11:48.000 --> 11:50.000
-che è gestito con
-11:50.000 --> 11:52.000
-diverse rappresentazioni.
-11:52.000 --> 11:54.000
-Di queste, l'unica rappresentazione
-11:54.000 --> 11:56.000
-in grigio nel plotto
-11:56.000 --> 11:58.000
-è Unisoft, e solo in questo caso
-11:58.000 --> 12:00.000
-LSVI-UCB è in grado
-12:00.000 --> 12:02.000
-di raggiungere regali costanti.
-12:02.000 --> 12:04.000
-A sinistra abbiamo il regalo
-12:04.000 --> 12:06.000
-del leader di LSVI
-12:06.000 --> 12:08.000
-che è gestito con vari set di candidati.
-12:08.000 --> 12:10.000
-In tutti questi casi,
-12:10.000 --> 12:12.000
-il leader di LSVI raggiunge
-12:12.000 --> 12:14.000
-regali costanti.
-12:14.000 --> 12:16.000
-Ovviamente, senza sapere
-12:16.000 --> 12:18.000
-la migliore rappresentazione in avanzo,
-12:18.000 --> 12:20.000
-ci serve più tempo per imparare la polizia ottima,
-12:20.000 --> 12:22.000
-ma questo è stato anche aspettato
-12:22.000 --> 12:24.000
-dalla nostra regola di selezione.
-12:24.000 --> 12:26.000
-Il plotto arancione è particolarmente
-12:26.000 --> 12:28.000
-interessante, perché in questo caso
-12:28.000 --> 12:30.000
-l'unica rappresentazione di Unisoft,
-12:30.000 --> 12:32.000
-numero 1,
-12:32.000 --> 12:34.000
-non è nel set di candidati,
-12:34.000 --> 12:36.000
-ma ancora LSVI-leader è in grado
-12:36.000 --> 12:38.000
-di raggiungere regali costanti
-12:38.000 --> 12:40.000
-combinando le representazioni rimaste.
-12:40.000 --> 12:42.000
-Nel lavoro futuro,
-12:42.000 --> 12:44.000
-vorremmo migliorare questo fattore
-12:44.000 --> 12:46.000
-di sqvrtn nel regalo del leader di LSVI,
-12:46.000 --> 12:48.000
-perché nel caso dei banditi lineari
-12:48.000 --> 12:50.000
-la dipendenza sull'umare
-12:50.000 --> 12:52.000
-delle rappresentazioni è solo logaritmica.
-12:52.000 --> 12:54.000
-Vorremmo anche
-12:54.000 --> 12:56.000
-estendere il leader di LSVI
-12:56.000 --> 12:58.000
-per gestire le rappresentazioni
-12:58.000 --> 13:00.000
-di candidati che sono miscele.
-13:00.000 --> 13:02.000
-Tuttavia, questa
-13:02.000 --> 13:04.000
-selezione delle rappresentazioni è
-13:04.000 --> 13:06.000
-solo un passaggio verso
-13:06.000 --> 13:08.000
-il learning of representation,
-13:08.000 --> 13:10.000
-che significa imparare
-13:10.000 --> 13:12.000
-la rappresentazione online da scratch.
-13:12.000 --> 13:14.000
-Questo è già fatto
-13:14.000 --> 13:16.000
-in pratica con il learning di
-13:16.000 --> 13:18.000
-rinforzamento profondo, ma la teoria
-13:18.000 --> 13:20.000
-di questo è scomoda.
-13:20.000 --> 13:22.000
-Finalmente, possiamo considerare
-13:22.000 --> 13:24.000
-il learning di rinforzamento multitasca,
-13:24.000 --> 13:26.000
-dove una singola rappresentazione
-13:26.000 --> 13:28.000
-potrebbe essere buona per un
-13:28.000 --> 13:30.000
-composto di MDPs che condividono
-13:30.000 --> 13:36.000
-una struttura. Grazie.

demo_data/nips-2021/25964/video.mp4 DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:21aef3b31235ac9e8a4e96500589de83c27b58f96e98f6a6c50b46d1fedd106e
-size 87305378