The mySidewalk Blog

Ding. Sidekick leveled up.

Written by Matt Barr | Sep 24, 2024 3:18:39 PM

The Highlights

You've probably noticed some big improvements if you've been using our smart data assistant, Sidekick. Here's what’s new:

  • Faster responses—now 2x quicker
  • More reliable results with fewer corrections
  • Smarter search with optimizations that handle even the trickiest queries

On top of that, there were a couple of nice updates to the experience. Sidekick now shares its plan for complex requests, giving you insight into its process. This helps you spot opportunities for further analysis and makes it easier to clarify issues if they arise. Sidekick’s access to our data catalog prevents factual errors, though it rarely introduced minor 'glitches' in repetitive text structures  - we haven't observed this issue since the update. It also ties in better than ever to your existing analysis and publishing workflows.

A 40-second Demo

As we've said before, you're just a good question away from insights with Sidekick. In this demo, that question is a complex one:

I'd like to compare alcohol consumption to life expectancy in cities within Jefferson County, KY. Can you show me a bivariate map of these items and tell me which place has the highest alcohol consumption and which has the lowest life expectancy?

Watch Sidekick as it outlines its plan, executes searches, and performs three unique analyses:

  • a bivariate map
  • a ranking based on alcohol consumption
  • a ranking based on life expectancy

Not only does Sidekick show us the visualization and underlying data for each, it also retrieves the appropriate data and sees "analysis-specific" stats and insights. After that, the results are synthesized into a summary, and the question is answered (correctly!). The geographies, data, and visualizations are all available to browse, double-check, reuse, and share to power up your workflow. In this demo, we end by doing just that, browsing the visualizations and sharing the bivariate map.

 

That was fast, accurate, thorough, and integrates a publishing workflow at the end - that's why we believe everybody needs a Sidekick.

 

The Nerdier Details

We're continuing to reuse and build upon the evals we've discussed previously, but Sidekick's performance has become so accurate that analyzing them has taken on a different color. From a top-line perspective:

  • We use automation and AI judges to scale testing; there are now so few "negative" (failing) test results that we can manually review every single one
  • We found no 'false positives' (tests that incorrectly passed) but did identify around 15 false negatives (tests that failed but should have passed) on "Answer Correctness" and ~7 for "Faithfulness"
  • Correcting the false negatives for "Faithfulness" reveals that Sidekick is now 100% faithful to the data in its catalog across repeated runs of our baseline evals
  • Its performance on multi-step analysis questions (like the demo) improved by 2.5x, simple questions by 3.7x, data search by 1.25x, and it now performs perfectly on geography search and understanding the availability of data in the catalog

In a previous post, we detailed how Sidekick answered a large range of evaluation questions faithfully with an error rate 30x better than ChatGPT reading from a small to medium table and 5x better than the state of the art "SpreadsheetLLM" technique introduced by Microsoft AI Research. It does this while accessing a data catalog ~8 million times the size of the data SpreadSheetLLM was tested on. In our updated baseline, Sidekick is 100% faithful to the data in the catalog. This was an unexpected result and broke the "X times better than ChatGPT/SpreadsheetLLM" comparison we were using before (it would be "infinitely better" by that calculation).

So, what do these evals look like? Frankly, they look a lot like the demo above (only automated and evaluated by an AI judge). That demo was directly taken from our "sidekick-analysis-suite." That verbatim question is part of the test, and then there are some additional instructions to the AI judges:

Expected Answer

Highest Alcohol Consumption occurs in Meadowbrook Farm and Parkway Village (~18%), while Lowest Life Expectancy is observed in Hollyvilla (~72 years).

Extra Instructions

Ensure that no errors were reported in generating the map and that the message doesn't contain ANY placeholder for the visualization (i.e. a link, markdown, etc.)

The automation system runs the conversation based on that input question. Then, the AI judges use a systematic approach to break down the question, expected answer, actual answer, and any additional tools/context used to produce metrics on correctness, faithfulness, relevance, and adherence to those extra instructions. I omitted some configuration data, a unique ID, and a friendly description for brevity, but that's it. That's one of hundreds of evaluation samples we use to build and optimize Sidekick. The only other item to note is that our evaluations lean on the more difficult side - we have a mixture of straightforward, analytical, and hard/tricky questions (often collected from past mistakes) that are harder than the distribution of questions Sidekick fields every day, so it's ready for action when you've got something more difficult in mind!

Want to experience Sidekick’s state-of-the-art data analysis and see how we push its performance? Let’s connect!