Enyan Zhang

Cline review

Enyan Zhang — Fri, 14 Feb 2025 05:00:00 GMT

Notice & Disclaimer: AI Generated Content

This post is initially generated by a language model, usually by summarizing a human conversation or expanding a human-written summary. The goal is not to populate the internet with yet another piece of uncalled for, AI-generated slop (in which, unfortunately, people working in AI are complicit). Rather, it is to enable lower friction in sharing and distilling information. I have worked on, and often significantly rewritten, the post to ensure it accurately reflects the underlying human intentions and experiences, but there may be inaccuracies and biases that remain.

TL;DR

Cline is a VSCode extension offering a bring-your-own-key alternative to GitHub Copilot, with the ability to execute commands and plan multi-step edits. While promising in concept, its high token usage and several UX limitations make it difficult to recommend over Copilot Edit (as of Feb 2025). It’s also not as “smart” as you think it might be¹. Cost per coding session can range from $0.5-3 depending on your model choice.

Also see the verdict section.

Overview

Cline operates as a chatbot-style interface within VS Code, capable of code generation, modification, and terminal command execution (which sounds more promising than it actually is). The default plan is completely free, with the only cost being your LM API calls.

Core Functionality

Unlike Copilot’s real-time completions, Cline works in a turn-based manner similar to Copilot Edit, where you request specific changes or additions and the AI responds with complete code snippets or modifications. The two most important features are the internal feedback loop and more generous access: Cline can execute code changes in steps following its own plan, and it can modify files/execute commands on your computer.

Pros

Bring-your-own-key! Use any LM and provider you want
Cline has a “plan” mode, in which it gathers information and makes a plan
Can request access to files/execute terminal commands
Offers checkpoint features for reverting changes

Cons

Cline determines when a task is complete, not you. Once it declares the task is complete it’s done. I find this really weird.
Very token-consuming: first request is often 10k+ tokens, hitting context limit is realistic. Each session can be $0.5-3 depending on your model, so expect to spend more than copilot/cursor if you let it run by itself.
No effective code verification: Cline can, in principle, run commands and check outputs, but it doesn’t do it reliably and use command outputs productively.
1. An example: I start a task telling Cline how to verify success (run the script with tests in it). Cline executes the command, and without checking the outputs, immediately declares the task is complete.
2. In general it feels much like vanilla AI autocomplete: once Cline generates a plan, it executes it step-by-step, without verifying after steps or re-planning. Think about if your initial plan for a coding project every worked out completely!²
Cannot revert to checkpoints before AI modifications (as of Feb 2025). This could be a really simple fix, but they don’t yet have it. You’d better have another copy/commit before Cline starts working on your code.
Each session has its own context, so Cline always starts by gathering information. This can be frustrating if your codebase is complicated.
Doesn’t feel as polished compared to Copilot Edit

Verdict

While Cline offers flexibility through custom API keys, it doesn’t eliminate what I think is the biggest bottleneck in coding — your thinking speed. It’s not reliable enough for you to only care about the high-level functions/designs³, so you still have to be in the loop, understand every line of code, and tell it specifically what to do. If you treat it as a human capable of executing on your high-level goals, you will be thoroughly disappointed. But if you treat it like a multi-turn Copilot Edit, it’s not too bad and can definitely be a productivity tool.

Footnotes

Especially given the model we use is ranked 18th in the world in programming. Yes, I tried O1. Yes, I tried DeepSeek R1. As of Feb 2025, you can’t code hands-off yet.↩︎
Spoiler alert: these executions don’t often work.↩︎
My general impression of what works/what doesn’t work in AI coding: describing only the high level input-output behavior equals diaster. Giving pseudocode or a complete description of the implementation works (and saves you a lot of time), but usually you need to debug it yourself.↩︎

Applying to graduate school

Enyan Zhang — Sat, 01 Feb 2025 05:00:00 GMT

The Meta-story

Meta
adjective: referring to itself or to the conventions of its genre; self-referential. ¹

I don’t think there was, like one would like to imagine, a “moment of revelation” for when I decided that I wanted to apply to grad school. Unlike some other stories you’d hear, it wasn’t that I always thought that I would one day go to grad school either. My opinion is that the real reason we, as humans, decide to do something is often much more complicated (and opaque) than the story we tell others, ourselves, and eventually convince ourselves is the case.

In the same vein, there are many people who write posts that give advice on applying to grad school. I owe much thanks to their work — see “Other Links” on the right for some I felt was really helpful. But I also cannot help but feel that advices are often too distilled: the advice-giver reflects on their experiences, thinks about the larger picture, and summarizes them into their advices. The downside of this process is that a lot of details are lost in the process, and inferring the intended situation for which a piece of advice is applicable is quite non-trivial. ²

Given the wealth of online resources these day, especially for applying to graduate schools in CS, I feel that the most helpful thing for me to do is to fill in the void of such lost details — so instead of giving advices from the unqualified position of a junior graduate student, I’ll try to write about my experiences: how I started doing research, what my application season was like, etc.. Hopefully the experience can be the medium of implicit advices, from which you get to decide what to take away.

Computer Science?

I learned some programming — which amounted to def, return, for, if, and else in Python — in high school, but nowhere near seriousness. In fact, I started college as a mechanical engineering major, and did it (mostly) for my two years at Rutgers. It wasn’t very fun. Most engineering programs in the US share a common core during the first 2 years, which covers the basics in a broad range of STEM topics. I think it’s because of ABET accredidation. If you’re an engineering major that wants to “build stuff”, you’ll be fairly disappointed during these 2 years.

The final straw (I think, in retrospect) was a mechanical engineering internship I did. I can’t complain much about the company or the projects I was tasked to do — I had an overall fairly positive experience — but there’s also the feeling that traditional mechanical engineering companies are simply not where “things happen”. It did not feel like a career I wanted to have.

Computer science, in contrast, is indeed where “things happen”. I was tranferring to Brown the semester after the internship, so it was a good excuse to start something new. I did the necessary placements, and took intro to CS and deep learning³ (a very weird combination) in the first semester of my junior year.

… and Language?

Interestingly, I think what’s most important to my starting my current research was the humanities electives required by ABET: during my first year at Rutgers, I took and immensely enjoyed 2 philosophy courses, logic and philosophy of language. Philosophy of language, in particular, opened a new world for me. I was fascinated by the project of introspectively characterizing our linguistic capabilities. Taking the (very well given!) advice of the need to have some STEM-humanities balance, I took a philosophy of language seminar, Sense and Reference⁴, the same semester as my first CS courses. It struck me that artificial systems — GPT-3 has been out for 2 years at that point — did not satisfy most of the assumptions we use when analyzing linguistic creatures, yet they seem to master language so well.

The same semester, driven by the fear of not having something to do during my junior year summer, I started looking for research opportunities. One such attempt was going to an ask-me-anything session hosted by Ellie, during which I unloaded my philosophy of language questions re. language models towards her. I enjoyed that a lot, and started dropping in to the lab meetings.

After a while, someone in Ellie’s lab asked for help on a new project. I volunteered⁵, and started working on it. Getting the first toy model to train took 3 months. I got a research assistantship during the summer, and getting the first proof-of-concept to work took another 2 months. I was lucky, though: when the start of my senior year approached, I had a project that was taking shape. The project was a great reflection of my interests: a union of philosophical questions about artificial and natural intelligence and technical neural network research.

Applying

Much like everyone else’s research projects, there was a lot of head-scratching and a lot of frustration involved in my first project. But the sense of fulfillment of finally getting something done, and more importantly, getting something new, something I cared about done led me to think that maybe research could be a career past graduation. I also realized that I had what’s minimally required to apply to grad school: I liked what I did, so I could apply to do similar things and use my experience to back up the application.

Late September 2023, I decided to apply: I wasn’t confident about getting offers⁶, but like a lot of other things in life, the cost of failing is sufficiently low so it’d be foolish not to try. I asked my advisors what programs/people she would recommend, added in authors of papers I read and admire, and built a list. I only included schools I actually want to go (which is against best practices given by this article, for example, if what you want is to have a place to go). As a remedial strategy, I decided that I will also apply to full-time jobs. It’s nice that the timelines are somehow separated — deadlines for PhD programs are usually in December, when companies slow down interviewing/hiring, and the biggest recruiting season is early fall, before grad school applications start. This choice added significant work, and I did not get a full-time offer (mostly due to my disastrous technical interviews), so I can’t comment on how advisable this strategy is. But I think it’s worth considering.

I also chatted with some graduate students about their experiences: it would seem that at Brown, people are in general fairly happy. No horror stories of unbearable pressure or terrible advisors, and many recommends it as a chance to do something you believe matters. That resonated with me a lot: I still think doing a PhD is what maximizes your chances of doing something you care about and think matters.

Despite being a habitual procrastinator (I wrote my Brown transfer essay 3 hours before the deadline!), I set a hard deadline for my SoP: first draft before the first day of November, and miraculously met it. I then went through a few revisions， mostly asking for comments from friends, asked around for recommendation letters⁷, and submitted my applications during finals week.

Post Application

There was a lot of anxiety after I turned in my applications: every email notification would make me jump, and I checked GradCafe multiple times a day. I quickly found out that I’m suffering from too much information: if a school I applied to has updates on GradCafe, I would start wondering if that means I’ll get rejected. In reality (and I think we all know), there are just way too many variables and one can’t reliably predict what’s behind the scene. I stopped polling GradCafe and social media sites for updates in January, and my anxiety lessened significantly.

I eventually started getting interviews — they can happen at any time, but mine turned out to be early — and enjoyed chatting with the professors that interviewed me. It felt a lot better than job interviews. Instead of being interrogated for technical details, My interviews are more like research chats⁸. If there’s a shared passion/niche in research, the chat usually goes quite well. I also tried selling people my project ideas.

Then in February and March, I went to school visits. I really enjoyed talking to people: grad students, professors, other visitors. And on a range of topics: my research, their research, where the field is going, where my (their) life is going, how my (their) life currently is. All of the trips are paid for, which is a nice cherry on top as well. I also found keeping notes very helpful: these chats are much more effective if you keep a list of questions, as well as whom to ask. For example, you might want to ask for comments on your potential advisor from their students, other students in the department, their collaborators, and themselves. It’s also very useful if you keep notes of their answers — maybe do that when you come back to the hotel at the end of the day. There are details that easily get lost if you don’t note them down: potential collaborators, papers to check out, bureaucratic red tape, etc..I personally find the notes really useful when making decisions.

But beyond obtaining information, I think you get a better feeling of what it’s like⁹ to be there at a school visit. The more you ask and experience, the clearer your picture is going to be — and I think that’s the best way of deciding where to go: imagine yourself being at those places, which one do you like more?

I talked to a lot of people (maybe too many) while trying to make the decision. In retrospect, the choice was pretty clear to begin with, I was just indecisive and wanted the best of both worlds. I think this is the case for many important decisions we make: if you picture yourself after choosing different options, the choice becomes a lot more clear. I sent response emails in early April, and accepted my offer officially — the official deadline is April 15 and I think you shouldn’t get pressured to decide early, but it’s also good to accept as soon as you have decided.

Afterwords

I am currently writing this blog post in my apartment, half a year after coming to Yale. I am, so far, enjoying my grad school life. I think there was a lot of luck (and privilege) involved in my application process, and we shouldn’t try to summarize too much from past examples. But in light of all the uncertainties, it’s even more important that one explore — had I not taken philosophy of language, I wouldn’t be here now — and attempt — had I decided too early that I wasn’t qualified, for research or for grad school, I also wouldn’t be here. I consider myself an unlikely and atypical applicant, and I suppose there are two sides of this story: part of the hope of writing this post is to give some other atypical applicant like myself a reference data point, but the other part is that not having a reference data point doesn’t mean something shouldn’t be attempted.

You can find my SoP here. I’m also happy to chat more.

Footnotes

Google English Dictionary, provided by Oxford Languages↩︎
There’s the saying that “for any non-trivial piece of advice, the opposite is often also true”. I don’t know what the source is, but I deeply feel that this is the case.↩︎
Oweing to having double majored statistics at Rutgers, I actually did have the background needed for deep learning.↩︎
The course gets its name from Frege, and it’s also an amazing course.↩︎
Without knowing Pytorch (I only learned Tensorflow) or Huggingface↩︎
One main worry was the fact that I had only started CS a year ago, and never did the classic sequence of requirements. That turned out to be less important that I anticipated.↩︎
A huge headache and stress factor, especially if your recommender is not very responsive.↩︎
I’ve seen some people prepare presentation slides about their research projects. I think this can be helpful, but if the default is to “chat about research”, my opinion is that presenting with slides actually kills the atmosphere and may lead to people fixating on technical details↩︎
How your life will be like is very much a type of qualia: borrowing classic philosophical theories, we can’t know this unless we lived it, and visits give you a good taste.↩︎

RNNs, Huggingface Trainer, and PackedSequence’s

Enyan Zhang — Fri, 31 Jan 2025 05:00:00 GMT

TL;DR

The Issue

When training with Huggingface Trainer, If your data collator (data_collator in Trainer or collate_fn for PyTorch DataLoader) outputs a PackedSequence for training an recurrent model (rnn/lstm/gru/who knows), there will be an assertion error assert isinstance(data, (list, tuple)) and len(data) == 2 triggered by line 254 of torch/nn/utils/rnn.py

The Solution

Huggingface trainer is sending the PackedSequence to the correct device (e.g. GPU) incorrectly, you need to override a method, see this section for the code.

An Even Better Way

See afterwords. This issue is completely avoidable if you define your model class differently.

Full Story

Background

Recently I was training toy RNNs for a project. Writing a train function with a for epoch in range(epochs) in 2024 felt very wrong (and unnecessary), so I thought about making everything work with Trainer of Huggingface Transformers. There are many good reasons for doing so (and it was a huge quality of life improvement!), I’ll list a few I’ve already used (and worked pretty much out of the box):

saving/loading models with a one-liner
adding/changing learning rate schedules
generating with .generate()
doing simple hyperparameter sweeps (see Hyperparameter Search)

But things don’t always work, and when they don’t work, debugging Trainer is frustrating — it does too many things and many such things rely on heuristics, below is an incomplete list of issues I already came across (and still remember debugging):

it assumes the training target is a dict entry called label or labels, and will skip evaluate() otherwise — but it won’t skip the eval loop entirely, instead it will only return eval metainfo such as runtime. The solution is to specify label-names in TrainingArguments.
it sends tensors to the model’s device by recursively iterating all inputs until it reaches the basic data elements (which should normally be some Tensor), but the heuristic for stopping this recursion is hasattr(data, "to") (see source) — so if you define a class that contains your custom data, it absolutely cannot have a to mathod that does something else.

And unfortunately one such heuristic breaks Pytorch RNNs. Here’s the premise:

Transformers deal with variable-length sequences by padding inputs
1. This is usually done by a DataCollator, which gets a list of dict and returns a dict of collated tensors (the action of “creating a batch” from samples)
2. Additionally, attention_mask helps model zero-out attention on padding tokens, so effectively the model does not “see” the padded tokens
RNNs also need to deal with variable-length input sequences
1. It’s best if we also delegate this task to a data-collating function
2. But RNN’s can’t deal with padding! There’s no trivial parallel for something like attention_mask, especially because Pytorch RNNs have are called with the entire sequence at once, as opposed to manually “unrolling” the model.

The solution of the above problem is to use a PackedSequence. The underlying idea is quite simple: instead of viewing the input as a batch of sequences, view it as a sequence of batches, where each batch can have a different batch size. The figure below illustrates it quite well¹:

A Visual Illustration of PackedSequence from @sgrvinod

So the solution seems simple enough: we just need to define a data collating function that creates a PackedSequence from a list of samples, like the one below, and life’s good, right?

from torch.nn.utils.rnn import pack_padded_sequence

def collate_fn(examples):
  # first collate, e.g. using torch.stack
  examples = {k: torch.stack([e[k] for e in examples]) for k in examples[0]}

  # assume you previously tokenized the input with a transformers tokenizer
  input_lengths = torch.sum(tokenized_input["attention_mask"], dim=1)
  examples["input_ids"] = pack_padded_sequence(
    examples["input_ids"], 
    input_lengths, 
    batch_first=True, 
    encforce_sorted=False
    )

  return examples

Why `Trainer` cannot process `PackedSequence`’s

If only life is so easy — I invite you to re-read the title of this post and realize that we’ve only just gotten to the issue. If you tried trainer.train() with a collate_fn like above, you will get the following cryptic error message:

  0%|                                                    | 0/12520 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "//src/train.py", line 243, in <module>
    main()
  File "//src/train.py", line 173, in main
    trainer.train()
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 2481, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 3573, in training_step
    inputs = self._prepare_inputs(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 3520, in _prepare_inputs
    inputs = self._prepare_input(inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 3502, in _prepare_input
    return type(data)({k: self._prepare_input(v) for k, v in data.items()})
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 3502, in <dictcomp>
    return type(data)({k: self._prepare_input(v) for k, v in data.items()})
                          ^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/transformers/trainer.py", line 3504, in _prepare_input
    return type(data)(self._prepare_input(v) for v in data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/torch/nn/utils/rnn.py", line 93, in __new__
    *_packed_sequence_init_args(
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "//.venv/lib64/python3.11/site-packages/torch/nn/utils/rnn.py", line 254, in _packed_sequence_init_args
    assert isinstance(data, (list, tuple)) and len(data) == 2
                                               ^^^^^^^^^^^^^^
AssertionError: 
  In call to configurable 'main' (<function main at 0x148164687240>)

What happend?? If you look at the call stack at this point, it’s roughly the following:

Trainer dispatches a batch (list of examples) to our collator
Collator does its job, returning a dict where the value corresponding to input_ids is a PackedSequence
The collated batch (now one dict) gets sent to _prepare_inputs, which then sends the batch to _prepare_input to map the inputs on the right devices
Since the collated bunch can have arbitrary nesting (think a dict of list of tensors), _parepare_input recursively calls itself until it reaches the bottom level — tensors — and puts them to the right device. See below:

def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]:
    """
    Prepares one `data` before feeding it to the model, be it a tensor or a nested list/dictionary of tensors.
    """
    if isinstance(data, Mapping):
        return type(data)({k: self._prepare_input(v) for k, v in data.items()})
    elif isinstance(data, (tuple, list)):
        return type(data)(self._prepare_input(v) for v in data)
    elif isinstance(data, torch.Tensor):
        kwargs = {"device": self.args.device}
        if self.is_deepspeed_enabled and (torch.is_floating_point(data) or torch.is_complex(data)):
            # NLP models inputs are int/uint and those get adjusted to the right dtype of the
            # embedding. Other models such as wav2vec2's inputs are already float and thus
            # may need special handling to match the dtypes of the model
            kwargs.update({"dtype": self.accelerator.state.deepspeed_plugin.hf_ds_config.dtype()})
        return data.to(**kwargs)
    return data

If you look at the error message, PackedSequence’s constructor here is complaining that it didn’t get enough arguments: there needs to be at least 2, the padded tensor and lengths of each example. If you use a debugger you’ll also find that the data getting passed here is only one tensor. Why?

It turns out, PackedSequence inherits from NamedTuple, which in turn is a tuple!

$ python
Python 3.12.7 | packaged by conda-forge | (main, Oct  4 2024, 15:57:01) [Clang 17.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from torch.nn.utils.rnn import PackedSequence
>>> a = PackedSequence(torch.tensor([[1, 2], [1, 1]]), torch.tensor([1, 2]))
>>> isinstance(a, tuple)
True

So in the second elif of _prepare_input, Huggingface trainer incorrectly iterates over it, thinking it’s a list of some sort, and then proceeds to attempt to instantiate a new PackedSequence. All the fuss because a slightly wrong heursitic.

Fixing the issue

Fixing the problem once we know what happened is fairly easy: a specific problem calls for a specific solution. Just define a mixin for Trainer classes that overrides default behavior if the data is a PackedSequence, and subsequenctly define new Trainer’s that inherits from the mixin.

If you have the exact issue, adding the codeblock below should be a simple fix (notice that it replaces Trainer and Seq2SeqTrainer by subclassing them).

from transformers import (
    Seq2SeqTrainer as HFSeq2SeqTrainer,  
    Trainer as HFTrainer,
)

class PrepareInputMixin:
    def _prepare_input(self, data: Union[torch.Tensor, Any]) -> Union[torch.Tensor, Any]:
        if isinstance(data, PackedSequence):
            return PackedSequence(self._prepare_input(data.data), data.batch_sizes, data.sorted_indices, data.unsorted_indices)
        else:
            return super()._prepare_input(data)

class Seq2SeqTrainer(PrepareInputMixin, HFSeq2SeqTrainer):
    pass

class Trainer(PrepareInputMixin, HFTrainer):
    pass

The code should now run! (or, at least, you should now see a different bug!)

Afterword

Only after I fixed this bug, I realized that this is totally preventable: an even better way to train RNNs is to do the packing (and unpacking) of tensors within the model’s forward method. This has a few advantages: it’s more compatible with huggingface’s api (you can, for example, sum attention_mask’s to infer the sequence length, or add an input_lengths argument), and it also makes embedding and encoder-decoder structures more intuitive. So something like the following

class RecurrentEncoder(PreTrainedModel):
    config_class = RecurrentEncoderConfig

    ... other methods ...

    def forward(
        self,
        input_ids: torch.LongTensor,
        input_lengths: Optional[torch.LongTensor],
        return_hidden_states: bool = False,
    ) -> BaseModelOutputWithNoAttention:

        embedded = self.embedding(input_ids)

        packed_embedded = torch.nn.utils.rnn.pack_padded_sequence(
            embedded, input_lengths, batch_first=True, enforce_sorted=True
        )

        packed_output, hidden = self.recurrent_unit(packed_embedded)

        hidden_states, _ = torch.nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True, padding_value=0.0)

        return BaseModelOutputWithNoAttention(
            hidden_states=hidden_states,
        )

I should probably tidy up and make a release for the reccurent models I wrote at some point.

Credits

Thumbnail image: Stanford CS 230
PackedSequence’s: This Github demo, and this Stackoverflow answer

Footnotes

In addition, I liked this StackOverflow answer explaining how it works.↩︎

Enyan Zhang

Cline review

TL;DR

Overview

Core Functionality

Pros

Cons

Verdict

Footnotes

Applying to graduate school

The Meta-story

Computer Science?

… and Language?

Applying

Post Application

Afterwords

Footnotes

RNNs, Huggingface Trainer, and PackedSequence’s

TL;DR

The Issue

The Solution

An Even Better Way

Full Story

Background

Why Trainer cannot process PackedSequence’s

Fixing the issue

Afterword

Credits

Footnotes

Why `Trainer` cannot process `PackedSequence`’s