• 0 Posts
  • 5 Comments
Joined 2 months ago
cake
Cake day: November 30th, 2024

help-circle
  • pcalau12i@lemmygrad.mltoOpen Source@lemmy.mlProton's biased article on Deepseek
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    1 hour ago

    There is no “fundamentally” here, you are referring to some abstraction that doesn’t exist. The models are modified during the fine-tuning process, and the process trains them to learn to adopt DeepSeek R1’s reasoning technique. You are acting like there is some “essence” underlying the model which is the same between the original Qwen and this model. There isn’t. It is a hybrid and its own thing. There is no such thing as “base capability,” the model is not two separate pieces that can be judged independently. You can only evaluate the model as a whole. Your comment is just incredibly bizarre to respond to because you are referring to non-existent abstractions and not actually speaking of anything concretely real.

    The model is neither Qwen nor DeepSeek R1, it is DeepSeek R1 Qwen Distill as the name says. it would be like saying it’s false advertising to say a mule is a hybrid of a donkey and a horse because the “base capabilities” is a donkey and so it has nothing to do with horses, and it’s really just a donkey at the end of the day. The statement is so bizarre I just do not even know how to address it. It is a hybrid, it’s its own distinct third thing that is a hybrid of them both. The model’s capabilities can only be judged as it exists, and its capabilities differ from Qwen and the original DeepSeek R1 as actually scored by various metrics.

    Do you not know what fine-tuning is? It refers to actually adjusting the weights in the model, and it is the weights that define the model. And this fine-tuning is being done alongside DeepSeek R1, meaning it is being adjusted to take on capabilities of R1 within the model. It gains R1 capabilities at the expense of Qwen capabilities as DeepSeek R1 Qwen Distill performs better on reasoning tasks but actually not as well as baseline models on non-reasoning tasks. The weights literally have information both of Qwen and R1 within them at the same time.

    Speaking of its “base capabilities” is a meaningless floating abstraction which cannot be empirically measured and doesn’t refer to anything concretely real. It only has its real concrete capabilities, not some hypothetical imagined capabilities. You accuse them of “marketing” even though it is literally free. All DeepSeek sells is compute to run models, but you can pay any company to run these distill models. They have no financial benefit for misleading people about the distill models.

    You genuinely are not making any coherent sense at all, you are insisting a hybrid model which is objectively different and objectively scores and performs differently should be given the exact same name, for reasons you cannot seem to actually articulate. It clearly needs a different name, and since it was created utilizing the DeepSeek R1 model’s distillation process to fine-tune it, it seems to make sense to call it DeepSeek R1 Qwen Distill. Yet for some reason you insist this is lying and misrepresenting it and it actually has literally nothing to do with DeepSeek R1 at all and it should just be called Qwen and we should pretend it is literally the same model despite it not being the same model as its training weights are different (you can do a “diff” on the two model files if you don’t believe me!) and it performs differently on the same metrics.

    There is simply no rational reason to intentionally want to mislabel the model as just being Qwen and having no relevance to DeepSeek R1. You yourself admitted that the weights are trained on R1 data so they necessarily contain some R1 capabilities. If DeepSeek was lying and trying to hide that the distill models are based on Qwen and Llama, they wouldn’t have literally put that in the name to let everyone know, and released a paper explaining exactly how those were produced.

    It is clear to me that you and your other friends here have some sort of alternative agenda that makes you not want to label it correctly. DeepSeek is open about the distill models using Qwen and Llama, but you want them to be closed and not reveal that they also used DeepSeek R1. The current name for it is perfectly fine and pretending it is just a Qwen model (or Llama, for the other distilled versioned) is straight-up misinformation, and anyone who downloads the models and runs them themselves will clearly see immediately that they perform differently. It is a hybrid model correctly called what they are: DeepSeek R1 Qwen Distill and DeepSeek R1 Llama Distill.


  • pcalau12i@lemmygrad.mltoOpen Source@lemmy.mlProton's biased article on Deepseek
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    2
    ·
    edit-2
    2 hours ago

    The 1.5B/7B/8B/13B/32B/70B models are all officially DeepSeek R1 models, that is what DeepSeek themselves refer to those models as. It is DeepSeek themselves who produced those models and released them to the public and gave them their names. And their names are correct, it is just factually false to say they are not DeepSeek R1 models. They are.

    The “R1” in the name means “reasoning version one” because it does not just spit out an answer but reasons through it with an internal monologue. For example, here is a simple query I asked DeepSeek R1 13B:

    Me: can all the planets in the solar system fit between the earth and the moon?

    DeepSeek: Yes, all eight planets could theoretically be lined up along the line connecting Earth and the Moon without overlapping. The combined length of their diameters (approximately 379,011 km) is slightly less than the average Earth-Moon distance (about 384,400 km), allowing them to fit if placed consecutively with no required spacing.

    However, on top of its answer, I can expand an option to see its internal monologue it went through before generating the answer, which you can find the internal monologue here because it’s too long to paste.

    What makes these consumer-oriented models different is that that rather than being trained on raw data, they are trained on synthetic data from pre-existing models. That’s what the “Qwen” or “Llama” parts mean in the name. The 7B model is trained on synthetic data produced by Qwen, so it is effectively a compressed version of Qen. However, neither Qwen nor Llama can “reason,” they do not have an internal monologue.

    This is why it is just incorrect to claim that something like DeepSeek R1 7B Qwen Distill has no relevance to DeepSeek R1 but is just a Qwen model. If it’s supposedly a Qwen model, why is it that it can do something that Qwen cannot do but only DeepSeek R1 can? It’s because, again, it is a DeepSeek R1 model, they add the R1 reasoning to it during the distillation process as part of its training. They basically use synthetic data generated from DeepSeek R1 to fine-tune readjust its parameters so it adopts a similar reasoning style. It is objectively a new model because it performs better on reasoning tasks than just a normal Qwen model. It cannot be considered solely a Qwen model nor an R1 model because its parameters contain information from both.


  • quantum nature of the randomly generated numbers helped specifically with quantum computer simulations, but based on your reply you clearly just meant that you were using it as a multi-purpose RNG that is free of unwanted correlations between the randomly generated bits

    It is used as the source of entropy for the simulator. Quantum mechanics is random, so to actually get the results you have to sample it. In quantum computing, this typically involves running the same program tens of thousands of times, which are called “shots,” and then forming a distribution of the results. The sampling with the simulator uses the QRNG for the source of entropy, so the sampling results are truly random.

    Out of curiosity, have you found that the card works as well as advertised? I ask because it seems to me that any imprecision in the design and/or manufacture of the card could introduce systematic errors in the quantum measurements that would result in correlations in the sampled bits, so I am curious if you have been able to verify that is not something to be concerned about.

    I have tried several hardware random number generators and usually there is no bias either because they specifically designed it not to have a bias or they have some level of post-processing to remove the bias. If there is a bias, it is possible to remove the bias yourself. There are two methods that I tend to use that depends upon the source of the bias.

    To be “random” simply means each bit is statistically independent of each other bit, not necessarily that the outcome is uniform, i.e. 50% chance of 0 and 50% chance of 1. It can still be considered truly random with a non-uniform distribution, such as 52% chance of 0 and 48% chance of 1, as long as each successive bit is entirely independent of any previous bit, i.e. there is no statistical analysis you could ever perform on the bits to improve your chances of predicting the next one beyond the initial distribution of 52%/48%.

    In the case where it is genuinely random (statistical independence) yet is non-uniform (which we can call nondeterministic bias), you can transform it into a uniform distribution using what is known as a von Neumann extractor. This takes advantage of a simple probability rule for statistically independent data whereby Pr(A)Pr(B)=Pr(B)Pr(A). Let’s say A=0 and B=1, then Pr(0)Pr(1)=Pr(1)Pr(0). That means you can read two bits at a time rather than one and throw out all results that are 00 and 11 and only keep results that are 01 or 10, and then you can map 01 to 0 and 10 to 1. You would then be mathematically guaranteed that the resulting distribution of bits are perfectly uniform with 50% chance of 0 and 50% chance of 1.

    I have used this method to develop my own hardware random number generator that can pull random numbers from the air, by analyzing tiny fluctuations in electrical noise in your environment using an antenna. The problem is that electromagnetic waves are not always hitting the antenna, so there can often be long strings of zeros, so if you set something up like this, you will find your random numbers are massively skewed towards zero (like 95% chance of 0 and 5% chance of 1). However, since each bit still is truly independent of the successive bit, using this method will give you a uniform distribution of 50% 0 and 50% 1.

    Although, one thing to keep in mind is the bigger the skew, the more data you have to throw out. With my own hardware random number generator I built myself that pulls the numbers from the air, it ends up throwing out the vast majority of the data due to the huge bias, so it can be very slow. There are other algorithms which throw out less data but they can be much more mathematically complicated and require far more resources.

    In the cases where it may not be genuinely random because the bias is caused by some imperfection in the design (which we can call deterministic bias), you can still uniformly distribute the bias across all the bits so that not only would be much more difficult to detect the bias, but you will still get uniform results. The way to do this is to take your random number and XOR it with some data set that is non-random but uniform, which you can generate from a pseudorandom number generator like the C’s rand() function.

    This will not improve the quality of the random numbers because, let’s say if it is biased 52% to 48% but you use this method to de-bias it so the distribution is 50% to 50%, if someone can predict the next value of the rand() function that would increase their ability to make a prediction back to 52% to 48%. You can make it more difficult to do so by using a higher quality pseudorandom number generator like using something like AES to generate the pseudorandom numbers. NIST even has standards for this kind of post-processing.

    But ultimately using this method is only obfuscation, making it more and more difficult to discover the deterministic bias by hiding it away more cleverly, but does not truly get rid of it. It’s impossible to take a random data set with some deterministic bias and trulyget rid of the deterministic bias purely through deterministic mathematical transformations,. You can only hide it away very cleverly. Only if the bias is nondeterministic can you get rid of it with a mathematical transformation.

    It is impossible to reduce the quality of the random numbers this way. If the entropy source is truly random and truly non-biased, then XORing it with the C rand() function, despite it being a low-quality pseudorandom number generator, is mathematically guaranteed to still output something truly random and non-biased. So there is never harm in doing this.

    However, in my experience if you find your hardware random number generator is biased (most aren’t), the bias usually isn’t very large. If something is truly random but biased so that there is a 52% chance of 0 and 48% chance of 1, this isn’t enough of a bias to actually cause much issues. You could even use it for something like cryptography and even if someone does figure out the bias, it would not increase their ability to predict keys enough to actually put anything at risk. If you use a cryptographysically secure pseudorandom number generator (CSPRNG) in place of something like C rand(), they will likely not be able to discover the bias in the first place, as these do a very good job at obfuscating the bias to the point that it will likely be undetectable.


  • I’m not sure what you mean by “turning into into a classical random number.” The only point of the card is to make sure that the sampling results from the simulator are truly random, down to a quantum level, and have no deterministic patterns in them. Indeed, actually using quantum optics for this purpose is a bit overkill as there are hardware random number generators which are not quantum-based and produce something good enough for all practical purposes, like Intel Secure Key Technology which is built into most modern x86 CPUs.

    For that reason, my software does allow you to select other hardware random number generators. For example, you can easily get an entire build (including the GPU) that can run simulations of 14 qubits for only a few hundred dollars if you just use the Intel Secure Key Technology option. It also supports a much cheaper device called TrueRNGv3 which is a USB device. It also has an option to use a pseudorandom number generator if you’re not that interested in randomness accuracy, and when using the pseudorandom number generator option it also supports “hidden variables” which really just act as the seed to the pseudorandom number generator.

    For most most practical purpose, no, you do not need this card and it’s definitely overkill. The main reason I even bought it was just because I was adding support for hardware random number generators to my software and I wanted to support a quantum one and so I needed to buy it to actually test it and make sure it works for it. But now I use it regularly for the back-end to my simulator just because I think it is neat.