ICYMI | March 2024

ICYMI — March 2024

Recap of my Data and AI posts from the past month

13 min readApr 2, 2024

Let’s wrap up the month with a roundup of what’s caught my eye the data, AI, and machine learning space this month. Let’s unravel what’s been on my mind this month.

Python is Always the best, except when it’s not

Python is always faster! Saving those 5ms of processing time isn’t worth no one on my team knowing how the code works.

Polars is always faster! My data isn’t big enough for the speed up to make sense for me to learn a new package.

Scala is always faster! Maybe there are fewer conversions in spark to convert to the actual execution plan, but spark SQL converts to the same execution plan.

Mac is always faster! Well I can’t for the life of me figure out how to do simple tasks on one, so it’s always going to be slower for me.

Give it a rest, not everyone needs to use the same tools. The ones you have and know how to use can still get the job done just fine.

Variance in AI RAG Model Performance

3% accuracy on LLaMA-2–7B on benchmark questions?!

It sounds unreal, but how you provide few-shot examples during in-context learning can lead to huge spreads in answer accuracy. As shown in a recent paper from Washington University, simple formatting like spaces, new lines, and colons can lead to accuracy ranges of 76 accuracy points (and as low as 3% like the one I mentioned). Theses prompts look the same to you and me, but they can be the difference between a model excelling in answering a question correctly and failing miserably.

What’s even more interesting, is how this sensitivity to few-shot example formatting is seen on every LLM, from GPT-3.5 to LLaMA-2. Even worse? The sensitivity can still be observed when increasing model size, number of few-shot examples, or doing instruction tuning.

There isn’t a format for few-shot examples that is universally good either. The researchers showed that a format performing well on one model didn’t mean much in terms of how it would perform with another model. Could we game this though? Maybe, if you know what format your model tends to have consistent performance on, you will likely see good results with that model.

What should we do know that we know this? Well the researchers suggest that instead of reporting accuracies on the common benchmarks with a single accuracy score, ranges could be presented instead. Who cares if the next GPT-5 has an even better accuracy score on the benchmarks if you aren’t formatting your few-shot examples the same way? How is that a fair comparison to other models? Maybe this is something Hugging Face could find a way to integrate into their leaderboards…

You can check out the paper here: https://arxiv.org/abs/2310.11324

The code is also released on GitHub: https://github.com/msclar/formatspread

Developers are really good translators

They take vague requirements from stakeholders and convert them to work in the strict and structured ways that are required for programming.

Some tasks are super easy and well defined. If you want to find an exhaustive list of valid next chess moves or load a file using Python, both of these tasks have very clear acceptance criteria. They don’t require many steps to complete correctly. These are ideal tasks to have AI help you with, but large language models (LLMs) aren’t up to the task of replacing developers today or anytime soon. Real software is so much more complicated.

I spend a surprisingly small amount of time as a data scientist actually coding and doing heads down work. In my field, almost every request is pretty unclear and undefined, so much of the work is figuring out how to define metrics, where to find data, what the problem being solved actually looks like when it’s done. It even means that sometimes you have to go back to your stakeholder and let them know their request isn’t going to be possible to complete.

Most problems are poorly defined and AI models are pretty bad at knowing what they don’t know. I don’t care how cool the devin.ai demos look, AI isn’t going to steal your job anytime soon.

What can help you stay competitive today? Use the tools at your disposal where it makes sense to do so. Need to code a feature in your project? There are a whole range of models that can help you get that done quickly.

Check out the Stack Overflow blog post that inspired this section: I really enjoy their content and it always gets me thinking about what the next big thing might be.

Microsoft AdaptivePaste

AI is great at generating sample code. I love being able to type plain language into GPT-4 and get back a nice code block that I can drop right into my Python notebook. But usually I’ll get a message after the code block telling me to “replace col1, col2, category, number with the actual column names in your DataFrame.”

Well it turns out some researchers at Microsoft came up with a way to automatically identifying and replacing the variables in copied code (like from ChatGPT or Stack Overflow) with the correct variables already in your code.

The researchers deployed their method to a plugin that was suprisingly good at identifying and replacing variables in copied code. AdaptivePaste can be trained to adapt source coe with 79.8% accuracy! Even more importantly, AdaptivePaste saved nearly 4 minutes vs human developers in some tasks.

Microsoft Blog on AdaptivePaste: https://www.microsoft.com/en-us/research/blog/microsoft-at-esec-fse-2023-ai-techniques-for-a-streamlined-coding-workflow/

Check out the pre-publication paper here: https://arxiv.org/abs/2205.11023

Daylight Saving Time

Unfortunately, March still includes a time change for some of us in the form of daylight saving time. Daylight saving time also happens to be the bane of my existence.

Nothing compares to how inconsequential times and dates feel 99% of the time, yet no other small thing has such a monumental impact on my programming and data processes.

So friendly reminder, if you have to handle daylight saving time this weekend, double check your code before you push to production, otherwise you might be springing forward even earlier than you hoped.

My favorite resources for handling dates and times:

Python datetime documentation: https://docs.python.org/3/library/datetime.html
Handling datetimes in R with lubridate: https://rstudio.github.io/cheatsheets/html/lubridate.html

As a bonus, I also gathered some of my thoughts on handling the basics with dates and times in Python in the article below.

Working With Dates in Python

Daylight saving time and time zones are easy!

realdrewdata.medium.com

Should you Code “Easy” Tasks?

If a coding task is easy, does that mean you shouldn’t create a method or attribute for it?

I saw a GitHub issue in Python where the maintainers politely said no to adding a very specific date format to the datetime library. It felt like a bit of a niche format, so I can see why they didn’t want to add it to the standard library.

At the end of the issue, the maintainer mentioned that they might consider adding an attribute to help with the request, but since the code to get the answer was so simple, it might make sense to do it yourself.

Meanwhile, lubridate, an R package for handling dates and times has so many functions for getting unique components of datetimes. Looking at the code, sometimes the functions are as simple as a comparison to the hour to see if the time is in the am or pm.

It’s two different development philosophies, but I tend to lean towards having more methods and attributes so I never have to remember or figure out how to do the calc again each time I do it.

Spaces or Tabs while Coding?

Spaces or tabs when coding?

It isn’t something I see come up too often, especially in the Python world, but it does come up from time to time. The argument is very similar to Databricks vs dbt or pandas vs polars. The decision isn’t always up to you and the benefits aren’t the sole driver of the decision.

Take whatever language you work in. Does it support using both? Only one? Then the decision might be made for you.

What about your team? Do they use one or the other? What does the style guide say? Do you even have a preferred style guide on your team for writing code? You can’t force tabs if everyone on your team is into spaces.

One common argument is that tabs don’t always show up the same across different systems, where 4 spaces is the same distance all the time. Tabs are a single character, saving space that might be precious in your application. But if the team has agreed to use spaces, these still might not be good enough reasons to switch.

Change is hard, and even if there could be benefits to it, it doesn’t always make sense. There’s no need to argue and spend time on how one is infinitely better than the other if the team is set in one way of doing things and it truly doesn’t matter that much at the end of the day.

Getting model results to be used is the hardest part of Data Science

Data projects aren’t done when the model is build and tuned.

We talk a lot about how difficult it is to get quality data and clean it up to use in models, and how that part of a data project can take 80% of the total time but have you ever tried getting people to adopt your model and use it in their processes?

You can build the best model ever created that runs in a fraction of a second and costs nothing to run, but if you can’t get it deployed or incorporated into the business user’s processes, you haven’t finished the project.

This adoption of the model is much easier if you start working with your stakeholders early on and they are invested in the improvement your model provides. They need to agree with how you deploy and the value the model creates for them so that they can bring it to their team and make an actual difference.

AI watermarking

Watermarking AI content is a key step to fighting deepfakes and misinformation.

Hugging face published a blog post last month detailing different methods for watermarking images, text, and audio content. It couldn’t have come at a better time, as it’s harder than ever to know if what you’re looking at is real. The proliferation of tools to create AI-generated content makes watermarking a necessary part of the AI toolkit.

These tools provide a way to prevent data from being used to train more AI models, help identify AI created content, and help document the provenance of digital media.

Some techniques are as simple as an indicator in an image. A few extra bits that can be read to say “I’m AI generated!” Others embed metadata about the image. Still others modify the image so that to a human it is still normal, but AI algorithms have trouble reading them properly.

Watermarking isn’t always easy though. While there are methods for detecting text, they are so inconsistent that OpenAI shut down their tool for detecting ChatGPT output because of low accuracy.

Check out the post to learn more about the details of watermarking AI content: https://huggingface.co/blog/watermarking

Knowing the right tool for the job is critical for success

I recently saw a post from a Python developer who had a client send them data, but the data was text in images. So they did the first thing that comes to every Python dev’s mind and used OCR (Tesseract for text extraction) to get the data out easily.

They coded up the solution, did a bit of testing and tweaking to get it right, and eventually got the text out of the images without having to do too much manual clean up after.

So what’s the issue? Microsoft PowerToys includes a Text Extractor utility, which by pressing Win + Shift + T you can select and extract any text on your screen. It works extremely well and it one of the tools in my arsenal to help tackle problems. It’s incredibly easy to use, but if you didn’t know it existed, you never would have known there was such an easy way to do it.

The free version is available as part of Microsoft PowerToys: https://learn.microsoft.com/en-us/windows/powertoys/text-extractor

Test yourself!

I think we can all agree that courses, certifications, and even college degrees aren’t the only way to demonstrate your skills to employers. Projects, blogs, newsletters, posts, and videos are all great ways to show off what you’ve been working on and how you applied your skills.

…But that doesn’t mean that every certification is meaningless. In the data world, I put a bit more stock in certifications that expire eventually. Technology and best practices evolve quickly, so having to demonstrate the latest skills every once in a while makes sense. It does come at the cost of, well, paying a large corporation each time you test.

Speaking of testing and certifications, I just earned the Databricks Generative AI Fundamentals badge. Databricks does offer a compelling solution if you company is looking to build AI use cases, so it’s worth comparing to other offerings out there.

AI really changes the build vs buy conversation

A few years ago I would have went with an off the shelf solution almost every time. There are so many options out there and many vendors are very open to feedback or will work with you to add features. In so many cases, there is someone out there offering exactly what you need.

With the incredible advancements in large language models over the past year, I can see the build vs. buy conversation being very different. There are so many options for developer augmentation tools through copilots and chat agents.

Midjourney is a great example of how much can be accomplished with a small team. They are one of the top AI image generation platforms out there, but in October 2023 they reportedly had fewer than 100 full-time employees. A small team with the right tools can do incredible things.

Matryoshka Embedding: Not your Grandma’s Word Embedding

Russian nesting doll word embeddings?

I was checking out the Hugging Face blog when I saw a post about Matryoshka, or Russian Nesting Doll, embeddings. At a super high level, they are regular embeddings that have been truncated to a shorter size of dimensions. Depending on how much of the original embedding is kept, the size and speed of embedding retrieval can be sped up dramatically, even on large-scale, real-world data sets.

The best part is you don’t have to sacrifice much accuracy in embeddings. The secret sauce is how the optimizer is configured during training. The training process creates embeddings for each dimension of embedding. The losses are added up and optimized for all levels. This incentivizes the model to put the most important parts of the embedding at the front of the vector representation, meaning the truncated version retains more information.

Check out the blog post (and play around with the embeddings near the end): https://huggingface.co/blog/matryoshka

Check out the paper: https://arxiv.org/abs/2205.13147

Get the pretrained models: https://github.com/RAIVNLab/MRL

AI isn’t the end all of increasing productivity

AI isn’t going to increase your productivity…

…by itself. Other changes are required to maximize productivity.

I was reading a blog post from Stack Overflow that had a lot of good points about where LLMs fit into the future of work and productivity. Right now, code generation, or codegen, tools rarely get programmers to a 100% finished product. They are excellent at generating code based on requirements and re-writing example code to apply to your specific use case, but the developer still needs to know what they want to accomplish to get the artificial intelligence to provide useful information.

I recently spent some time with someone who has excellent subject matter expertise but doesn’t have a spark coding background. They wanted to do some feature engineering but didn’t know how to write the spark code in Python to get the task done. They turned to ChatGPT and were able to write functioning code that accomplished the task. They didn’t have to wait for me to help them write anything, and I got to spend my day working on something else. Productivity was up!

AI codegen tools are not always perfect though. They have a habit of confidently providing valid sounding information, which is actually false. You’ll hear this called hallucination. Someone who doesn’t have a lot of experience won’t be able to tell these from legitimate code, and even if it is harmless (which sometimes it isn’t!) they will struggle to correct any errors. Productivity is not so up.

I’ve seen a lot of news about Devin.ai. A very cool concept to be sure but goes to show how codegen tools still have a way to go to help developers write quality code effectively. If you haven’t seen it yet, it takes an input task from the user. It then has access to code, command line, browser, and other resources, just like a developer would, and uses them to attempt to solve the problem. It can solve problems and fix errors, but it isn’t that successful yet. Your job is safe for now, or at least until the coding robots evolve farther.

Some other applications for codegen that were touched on in the blogpost:

Code gen great for writing unit tests…but you still need to know what to test for.
Code gen is great for documenting and explaining code…so you can understand what it does and how that will impact new features and enhancements.

I highly recommend going to read the full article for yourself on the Stack Overflow blog: https://stackoverflow.blog/2023/10/16/is-ai-enough-to-increase-your-productivity/

Also check out the Stack Overflow Podcast. They are always talking about interesting things in this space and it’s short, sweet, and to the point: https://stackoverflow.blog/podcast

How can you write code that explains itself?

Could you write code that explains itself?

With programming, there are so many ways to reach the same outcome. Sometimes one way is subjectively better, but there are some times when there is an obviously superior way to complete a task.

I’ve seen many people getting different elements of a spark datetime using the substring function. It works, but what on earth does a substring that starts at the 1st character of a string and is 4 characters long mean?

I like to use functions that describe what they are doing. That way, the next time I look back and need to know if the feature I just created is the month or year of a datetime, the function tells me straight up which it is.

Check out the examples. The 2 result variables should get the same output, but only the 2nd one uses functions purpose-built to explain themselves.

That’s all for March! If you’re looking for more consolidated content like this, be sure to follow me for a monthly download of what I’ve been looking at each month.

Drew Seewald - Medium

Read writing from Drew Seewald on Medium. Data Scientist | Twitter @RealDrewData | LinkedIn…